SIGIR ’23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
SESSION: Keynote Talks
Historically, information retrieval systems have all followed the same paradigm: information seekers frame their needs in the form of a short query, the system selects a small set of relevant results from a corpus of available documents, rank-orders the results by decreasing relevance, possibly excerpts a responsive passage for each result, and returns a list of references and excerpts to the user. Retrieval systems typically did not attempt fusing information from multiple documents into an answer and displaying that answer directly. This was largely due to available technology: at the core of each retrieval system is an index that maps lexical tokens or semantic embeddings to document identifiers. Indices are designed for retrieving responsive documents; they do not support integrating these documents into a holistic answer.
More recently, the coming-of-age of deep neural networks has dramatically improved the capabilities of large language models (LLMs). Trained on a large corpus of documents, these models not only memorize the vocabulary, morphology and syntax of human languages, but have shown to be able to memorize facts and relations. Generative language models, when provided with a prompt, will extend the prompt with likely completions — an ability that can be used to extract answers to questions from the model. Two years ago, Metzler et al. argued that this ability of LLMs will allow us to rethink the search paradigm: to answer information needs directly rather that directing users to responsive primary sources. Their vision was not without controversy; the following year Shaw and Bender argued that such a system is neither feasible nor desirable. Nonetheless, the past year has seen the emergence of such systems, with offerings from established search engines and multiple new entrants to the industry.
The keynote will summarize the short history of these generative information retrieval systems, and focus on the many open challenges in this emerging field: ensuring that answers are grounded, attributing answer passages to a primary source, providing nuanced answers to non-factoid-seeking questions, avoiding bias, and going beyond simple regurgitation of memorized facts. It will also touch on the changing nature of the content ecosystem. LLMs are starting to be used to generate web content. Should search engines treat such derived content equal to human-authored content? Is it possible to distinguish generated from original content? How should we view hybrid authorship where humans contribute ideas and LLMs shape these ideas into prose? And how will this parallel technical evolution of search engines and content ecosystems affect their respective business models?
Machine learning is everywhere, but unfortunately, we are not experts of every method. Sometimes we “inappropriately” use machine learning techniques. Examples include reporting training instead of test performance and comparing two methods without suitable hyper-parameter searches. However, the reality is that there are more sophisticated or more subtle examples, which we broadly call the “rough use” of machine learning techniques. The setting may be roughly fine, but seriously speaking, is inappropriate. We briefly discuss two intriguing examples.
– In the topic of graph representation learning, to evaluate the quality of the obtained representations, the multi-label problem of node classification is often considered. An unrealistic setting was used in almost the entire area by assuming that the number of labels of each test instance is known in the prediction stage. In practice, such ground truth information is rarely available. Details of this interesting story are in Lin et al. (2021).
– In training deep neural networks, the optimization process often relies on the validation performance for termination or selecting the best epoch. Thus in many public repositories, training, validation, and test sets are explicitly provided. Many think this setting is standard in applying any machine learning technique. However, except that the test set should be completely independent, users can do whatever the best setting on all the available labeled data (i.e., training and validation sets combined). Through real stories, we show that many did not clearly see the relation between training, validation, and test sets.
The rough use of machine learning methods is common and sometimes unavoidable. The reason is that nothing is called a perfect use of a machine learning method. Further, it is not easy to assess the seriousness of the situation. We argue that having high-quality and easy-to-use software is an important way to improve the practical use of machine learning techniques.
Digital user experiences are a mainstay of modern communication and commerce; multi-billion dollar industries have arisen around optimizing digital design. Usage analytics and A/B testing solutions allow growth hackers to quantitatively compute conversion over key user journeys, while user experience (UX) testing platforms enable UX researchers to qualitatively analyze usability and brand perception. Although these workflows are in pursuit of the same objective – producing better UX – the gulf between quantitative and qualitative testing is wide: they involve different stakeholders, and rely on disparate methodologies, budget, data streams, and software tools. This gap belies the opportunity to create a single platform that optimizes digital experiences holistically: using quantitative methods to uncover what and how much and qualitative analysis to understand why.
Such a platform could monitor conversion funnels, identify anomalous behaviors, intercept live users exhibiting those behaviors, and solicit explicit feedback in situ. This feedback could take many forms: survey responses, screen recordings of participants performing tasks, think-aloud audio, and more. By combining data from multiple users and correlating across feedback types, the platform could surface not just insights that a particular conversion funnel had been affected, but hypotheses about what had caused the change in user behavior. The platform could then rank these insights by how often the observed behavior occurred in the wild, using large-scale analytics to contextualize the results from small-scale UX tests.
To this end, a decade of research has focused on interaction mining: a set of techniques for capturing interaction and design data from digital artifacts, and aggregating these multimodal data streams into structured representations bridging quantitative and qualitative experience testing[1-4]. During user sessions, interaction mining systems record user interactions (e.g., clicks, scrolls, text input), screen captures, and render-time data structures (e.g., website DOMs, native app view hierarchies). Once captured, these data streams are aligned and combined into user traces, sequences of user interactions semanticized by the design data of their UI targets . The structure of these traces affords new workflows for composing quantitative and qualitative methods, building toward a unified platform for optimizing digital experiences.
Tasks are central to information retrieval (IR) and drive interactions with search systems [2, 4, 10]. Understanding and modeling tasks helps these systems better support user needs [8, 9, 11]. This keynote focuses on search tasks, the emergence of generative artificial intelligence (AI), and the implications of recent work at their intersection for the future of search. Recent estimates suggest that half of Web search queries go unanswered, many of them connected to complex search tasks that are ill-defined or multi-step and span several queries. AI copilots, e.g., ChatGPT and Bing Chat, are emerging to address complex search tasks and many other challenges. These copilots are built on large foundation models such as GPT-4 and are being extended with skills and plugins. Copilots broaden the surface of tasks achievable via search, moving toward creation not just finding (e.g., interview preparation, email composition), and can make searchers more efficient and more successful.
Users currently engage with AI copilots via natural language queries and dialog and the copilots generate answers with source attribution . However, in delegating responsibility for answer generation, searchers also lose some control over aspects of the search process, such as directly manipulating queries and examining lists of search results . The efficiency gains from auto-generating a single, synthesized answer may also reduce opportunities for user learning and serendipity. A wholesale move to copilots for all search tasks is neither practical nor necessary: model inference is expensive, conversational interfaces are unfamiliar to many users in a search context, and traditional search already excels for many types of task. Instead, experiences that unite search and chat are becoming more common, enabling users to adjust the modality and other aspects (e.g., answer tone) based on the task.
The rise of AI copilots creates many opportunities for IR, including aligning generated answers with user intent, tasks, and applications via human feedback ; understanding copilot usage, including functional fixedness ; using context and data to tailor responses to people and situations (e.g., grounding, personalization); new search experiences (e.g., unifying search and chat); reliability and safety (e.g., accuracy, bias); understanding impacts on user learning and agency; and evaluation (e.g., model-based feedback, searcher simulations  repeatability). Research in these and related areas will enable search systems to more effectively utilize new copilot technologies together with traditional search to help searchers better tackle a wider variety of tasks.
SESSION: Session 1 – Unbiased and Fairness
The graph neural network-based collaborative filtering (CF) models user-item interactions as a bipartite graph and performs iterative aggregation to enhance performance. Unfortunately, the aggregation process may amplify the popularity bias, which impedes user engagement with niche (unpopular) items. While some efforts have studied the popularity bias in CF, they often focus on modifying loss functions, which can not fully address the popularity bias in GNN-based CF models. This is because the debiasing loss can be falsely backpropagated to non-target nodes during the backward pass of the aggregation.
In this work, we study whether we can fundamentally neutralize the popularity bias in the aggregation process of GNN-based CF models. This is challenging because 1) estimating the effect of popularity is difficult due to the varied popularity caused by the aggregation from high-order neighbors, and 2) it is hard to train learnable popularity debiasing aggregation functions because of data sparsity. To this end, we theoretically analyze the cause of popularity bias and propose a quantitative metric, named inverse popularity score, to measure the effect of popularity in the representation space. Based on it, a novel graph aggregator named APDA is proposed to learn per-edge weight to neutralize popularity bias in aggregation. We further strengthen the debiasing effect with a weight scaling mechanism and residual connections. We apply APDA to two backbones and conduct extensive experiments on three real-world datasets. The results show that APDA significantly outperforms the state-of-the-art baselines in terms of recommendation performance and popularity debiasing.
User interaction data is an important source of supervision in counterfactual learning to rank (CLTR). Such data suffers from presentation bias. Much work in unbiased learning to rank (ULTR) focuses on position bias, i.e., items at higher ranks are more likely to be examined and clicked. Inter-item dependencies also influence examination probabilities, with outlier items in a ranking as an important example. They are defined as items that observably deviate from the rest and therefore stand out in the ranking. In this paper, we identify and introduce the bias brought about by outlier items: users tend to click more on outlier items and their close neighbors.
To this end, we first conduct a controlled experiment to study the effect of outliers on user clicks. Next, to examine whether the findings of our study generalize to naturalistic situations, we explore real-world click logs from an e-commerce platform. We show that, in both scenarios, users tend to click significantly more on outlier items compared to non-outlier items in the same rankings. We show that this tendency holds for all positions, i.e., for any specific position, an item receives more interactions when presented as an outlier as opposed to a non-outlier item. We conclude from our analysis that the outliers’ effect on clicks is a type of bias that should be addressed in ULTR. We therefore propose an outlier-aware click model that accounts for both outlier and position bias, called outlier-aware position-based model (OPBM). We estimate click propensities based on OPBM; through extensive experiments performed on both real-world e-commerce data and semi-synthetic data, we verify the effectiveness of our outlier-aware click model. Our results show the superiority of OPBM against baselines in terms of ranking performance when outlier bias is severe.
The issue of fairness in recommendation systems has recently become a matter of growing concern for both the academic and industrial sectors due to the potential for bias in machine learning models. One such bias is that of feedback loops, where the collection of data from an unfair online system hinders the accurate evaluation of the relevance scores between users and items. Given that recommendation systems often recommend popular content and vendors, the underlying relevance scores between users and items may not be accurately represented in the training data. Hence, this creates a feedback loop in which the user is not longer recommended based on their true relevance score but instead based on biased training data. To address this problem of feedback loops, we propose a two-stage representation learning framework, B-FAIR, aimed at rectifying the unfairness caused by biased historical data in recommendation systems. The framework disentangles the context data into sensitive and non-sensitive components using a variational autoencoder and then applies a novel Balanced Fairness Objective (BFO) to remove bias in the observational data when training a recommendation model. The efficacy of B-FAIR is demonstrated through experiments on both synthetic and real-world benchmarks, showing improved performance over state-of-the-art algorithms.
Most of the existing personalized recommendation methods predict the probability that one user might interact with the next item by matching their representations in the latent space. However, as a cognitive task, it is essential for an impressive recommender system to acquire the cognitive capacity rather than to decide the users’ next steps by learning the pattern from the historical interactions through matching-based objectives. Therefore, in this paper, we propose to model the recommendation as a logical reasoning task which is more in line with an intelligent recommender system. Different from the prior works, we embed each query as a box rather than a single point in the vector space, which is able to model sets of users or items enclosed and logical operators (e.g., intersection) over boxes in a more natural manner. Although modeling the logical query with box embedding significantly improves the previous work of reasoning-based recommendation, there still exist two intractable issues including aggregation of box embeddings and training stalemate in critical point of boxes. To tackle these two limitations, we propose a Contrastive Box learning framework for Collaborative Reasoning (CBox4CR). Specifically, CBox4CR combines a smoothed box volume-based contrastive learning objective with the logical reasoning objective to learn the distinctive box representations for the user’s preference and the logical query based on the historical interaction sequence. Extensive experiments conducted on four publicly available datasets demonstrate the superiority of our CBox4CR over the state-of-the-art models in recommendation task.
Many re-ranking strategies in search systems rely on stochastic ranking policies, encoded as Doubly-Stochastic (DS) matrices, that satisfy desired ranking constraints in expectation, e.g., Fairness of Exposure (FOE). These strategies are generally two-stage pipelines: (i) an offline re-ranking policy construction step and (ii) an online sampling of rankings step. Building a re-ranking policy requires repeatedly solving a constrained optimization problem, one for each issued query. Thus, it is necessary to recompute the optimization procedure for any new/unseen query. Regarding sampling, the Birkhoff-von-Neumann decomposition (BvND) is the favored approach to draw rankings from any DS-based policy. Nonetheless, the BvND is too costly to compute online. Hence, the BvND as a sampling solution is memory-consuming as it can grow as O(N n2) for N queries and n documents.
This paper proposes a novel, fast, lightweight way to predict fair stochastic re-ranking policies: Constrained Meta-Optimal Transport (CoMOT). This method fits a neural network shared across queries like a learning-to-rank system. We also introduce Gumbel-Matching Sampling (GumMS), an online sampling approach from DS-based policies. Our proposed pipeline, CoMOT + GumMS, only needs to store the parameters of a single model, and it can generalize to unseen queries. We empirically evaluated our pipeline on the TREC 2019 and 2020 datasets under FOE constraints. Our experiments show that CoMOT rapidly predicts fair re-ranking policies on held-out data, with a speed-up proportional to the average number of documents per query. It also displays fairness and ranking performance similar to the original optimization-based policy. Furthermore, we empirically validate the effectiveness of GumMS to approximate DS-based policies in expectation. Together, our methods are an important step in learning-to-predict solutions to optimization problems in information retrieval.
SESSION: Session 2 – Sequential Recommendation – Part 1
Sequential recommendation aims to capture users’ dynamic interest and predicts the next item of users’ preference. Most sequential recommendation methods use a deep neural network as sequence encoder to generate user and item representations. Existing works mainly center upon designing a stronger sequence encoder. However, few attempts have been made with training an ensemble of networks as sequence encoders, which is more powerful than a single network because an ensemble of parallel networks can yield diverse prediction results and hence better accuracy. In this paper, we present Ensemble Modeling with contrastive Knowledge Distillation for sequential recommendation (EMKD). Our framework adopts multiple parallel networks as an ensemble of sequence encoders and recommends items based on the output distributions of all these networks. To facilitate knowledge transfer between parallel networks, we propose a novel contrastive knowledge distillation approach, which performs knowledge transfer from the representation level via Intra-network Contrastive Learning (ICL) and Cross-network Contrastive Learning (CCL), as well as Knowledge Distillation (KD) from the logits level via minimizing the Kullback-Leibler divergence between the output distributions of the teacher network and the student network. To leverage contextual information, we train the primary masked item prediction task alongside the auxiliary attribute prediction task as a multi-task learning scheme. Extensive experiments on public benchmark datasets show that EMKD achieves a significant improvement compared with the state-of-the-art methods. Besides, we demonstrate that our ensemble method is a generalized approach that can also improve the performance of other sequential recommenders. Our code is available at this link: https://github.com/hw-du/EMKD.
The long-tailed problem is a long-standing challenge in Sequential Recommender Systems (SRS) in which the problem exists in terms of both users and items. While many existing studies address the long-tailed problem in SRS, they only focus on either the user or item perspective. However, we discover that the long-tailed user and item problems exist at the same time, and considering only either one of them leads to sub-optimal performance of the other one. In this paper, we propose a novel framework for SRS, called Mutual Enhancement of Long-Tailed user and item (MELT), that jointly alleviates the long-tailed problem in the perspectives of both users and items. MELT consists of bilateral branches each of which is responsible for long-tailed users and items, respectively, and the branches are trained to mutually enhance each other, which is trained effectively by a curriculum learning-based training. MELT is model-agnostic in that it can be seamlessly integrated with existing SRS models. Extensive experiments on eight datasets demonstrate the benefit of alleviating the long-tailed problems in terms of both users and items even without sacrificing the performance of head users and items, which has not been achieved by existing methods. To the best of our knowledge, MELT is the first work that jointly alleviates the long-tailed user and item problems in SRS.
The self-attention mechanism, which equips with a strong capability of modeling long-range dependencies, is one of the extensively used techniques in the sequential recommendation field. However, many recent studies represent that current self-attention based models are low-pass filters and are inadequate to capture high-frequency information. Furthermore, since the items in the user behaviors are intertwined with each other, these models are incomplete to distinguish the inherent periodicity obscured in the time domain. In this work, we shift the perspective to the frequency domain, and propose a novel Frequency Enhanced Hybrid Attention Network for Sequential Recommendation, namely FEARec. In this model, we firstly improve the original time domain self-attention in the frequency domain with a ramp structure to make both low-frequency and high-frequency information could be explicitly learned in our approach. Moreover, we additionally design a similar attention mechanism via auto-correlation in the frequency domain to capture the periodic characteristics and fuse the time and frequency level attention in a union model. Finally, both contrastive learning and frequency regularization are utilized to ensure that multiple views are aligned in both the time domain and frequency domain. Extensive experiments conducted on four widely used benchmark datasets demonstrate that the proposed model performs significantly better than the state-of-the-art approaches.
Contrastive Learning (CL) performances as a rising approach to address the challenge of sparse and noisy recommendation data. Although having achieved promising results, most existing CL methods only perform either hand-crafted data or model augmentation for generating contrastive pairs to find a proper augmentation operation for different datasets, which makes the model hard to generalize. Additionally, since insufficient input data may lead the encoder to learn collapsed embeddings, these CL methods expect a relatively large number of training data (e.g., large batch size or memory bank) to contrast. However, not all contrastive pairs are always informative and discriminative enough for the training processing. Therefore, a more general CL-based recommendation model called Meta-optimized Contrastive Learning for sequential Recommendation (MCLRec) is proposed in this work. By applying both data augmentation and learnable model augmentation operations, this work innovates the standard CL framework by contrasting data and model augmented views for adaptively capturing the informative features hidden in stochastic data augmentation. Moreover, MCLRec utilizes a meta-learning manner to guide the updating of the model augmenters, which helps to improve the quality of contrastive pairs without enlarging the amount of input data. Finally, a contrastive regularization term is considered to encourage the augmentation model to generate more informative augmented views and avoid too similar contrastive pairs within the meta updating. The experimental results on commonly used datasets validate the effectiveness of MCLRec.
Sequential recommender systems (SRSs) aim to predict the subsequent items which may interest users via comprehensively modeling users’ complex preference embedded in the sequence of user-item interactions. However, most of existing SRSs often model users’ single low-level preference based on item ID information while ignoring the high-level preference revealed by item attribute information, such as item category. Furthermore, they often utilize limited sequence context information to predict the next item while overlooking richer inter-item semantic relations. To this end, in this paper, we proposed a novel hierarchical preference modeling framework to substantially model the complex low- and high-level preference dynamics for accurate sequential recommendation. Specifically, in the framework, a novel dual-transformer module and a novel dual contrastive learning scheme have been designed to discriminatively learn users’ low- and high-level preference and to effectively enhance both low- and high-level preference learning respectively. In addition, a novel semantics-enhanced context embedding module has been devised to generate more informative context embedding for further improving the recommendation performance. Extensive experiments on six real-world datasets have demonstrated both the superiority of our proposed method over the state-of-the-art ones and the rationality of our design.
SESSION: Session 3 – Dense Retrieval
A Symmetric Dual Encoding Dense Retrieval Framework for Knowledge-Intensive Visual Question Answering
Knowledge-Intensive Visual Question Answering (KI-VQA) refers to answering a question about an image whose answer does not lie in the image. This paper presents a new pipeline for KI-VQA tasks, consisting of a retriever and a reader. First, we introduce DEDR, a symmetric dual encoding dense retrieval framework in which documents and queries are encoded into a shared embedding space using uni-modal (textual) and multi-modal encoders. We introduce an iterative knowledge distillation approach that bridges the gap between the representation spaces in these two encoders. Extensive evaluation on two well-established KI-VQA datasets, i.e., OK-VQA and FVQA, suggests that DEDR outperforms state-of-the-art baselines by 11.6% and 30.9% on OK-VQA and FVQA, respectively.
Utilizing the passages retrieved by DEDR, we further introduce MM-FiD, an encoder-decoder multi-modal fusion-in-decoder model, for generating a textual answer for KI-VQA tasks. MM-FiD encodes the question, the image, and each retrieved passage separately and uses all passages jointly in its decoder. Compared to competitive baselines in the literature, this approach leads to 5.5% and 8.5% improvements in terms of question answering accuracy on OK-VQA and FVQA, respectively.
Developing a universal model that can efficiently and effectively respond to a wide range of information access requests-from retrieval to recommendation to question answering—has been a long-lasting goal in the information retrieval community. This paper argues that the flexibility, efficiency, and effectiveness brought by the recent development in dense retrieval and approximate nearest neighbor search have smoothed the path towards achieving this goal. We develop a generic and extensible dense retrieval framework, called framework, that can handle a wide range of (personalized) information access requests, such as keyword search, query by example, and complementary item recommendation. Our proposed approach extends the capabilities of dense retrieval models for ad-hoc retrieval tasks by incorporating user-specific preferences through the development of a personalized attentive network. This allows for a more tailored and accurate personalized information access experience. Our experiments on real-world e-commerce data suggest the feasibility of developing universal information access models by demonstrating significant improvements even compared to competitive baselines specifically developed for each of these individual information access tasks. This work opens up a number of fundamental research directions for future exploration.
Recent studies have shown that Dense Retrieval (DR) techniques can significantly improve the performance of first-stage retrieval in IR systems. Despite its empirical effectiveness, the application of DR is still limited. In contrast to statistic retrieval models that rely on highly efficient inverted index solutions, DR models build dense embeddings that are difficult to be pre-processed with most existing search indexing systems. To avoid the expensive cost of brute-force search, the Approximate Nearest Neighbor (ANN) algorithm and corresponding indexes are widely applied to speed up the inference process of DR models. Unfortunately, while ANN can improve the efficiency of DR models, it usually comes with a significant price on retrieval performance.
To solve this issue, we propose JTR, which stands for Joint optimization of TRee-based index and query encoding. Specifically, we design a new unified contrastive learning loss to train tree-based index and query encoder in an end-to-end manner. The tree-based negative sampling strategy is applied to make the tree have the maximum heap property, which supports the effectiveness of beam search well. Moreover, we treat the cluster assignment as an optimization problem to update the tree-based index that allows overlapped clustering. We evaluate JTR on numerous popular retrieval benchmarks. Experimental results show that JTR achieves better retrieval performance while retaining high system efficiency compared with widely-adopted baselines. It provides a potential solution to balance efficiency and effectiveness in neural retrieval system designs.
Neural retrievers have been shown to be effective for math-aware search. Their ability to cope with math symbol mismatches, to represent highly contextualized semantics, and to learn effective representations are critical to improving math information retrieval. However, the most effective retriever for math remains impractical as it depends on token-level dense representations for each math token, which leads to prohibitive storage demands, especially considering that math content generally consumes more tokens. In this work, we try to alleviate this efficiency bottleneck while boosting math information retrieval effectiveness via hybrid search. To this end, we propose MABOWDOR, a Math-Aware Bestof-Worlds Domain Optimized Retriever, which has an unsupervised structure search component, a dense retriever, and optionally a sparse retriever on top of a domain-adapted backbone learned by context-enhanced pretraining, each addressing a different need in retrieving heterogeneous data from math documents. Our hybrid search outperforms the previous state-of-the-art math IR system while eliminating efficiency bottlenecks. Our system is available at https://github.com/approach0/pya0.
Retrieval approaches that score documents based on learned dense vectors (i.e., dense retrieval) rather than lexical signals (i.e., conventional retrieval) are increasingly popular. Their ability to identify related documents that do not necessarily contain the same terms as those appearing in the user’s query (thereby improving recall) is one of their key advantages. However, to actually achieve these gains, dense retrieval approaches typically require an exhaustive search over the document collection, making them considerably more expensive at query-time than conventional lexical approaches. Several techniques aim to reduce this computational overhead by approximating the results of a full dense retriever. Although these approaches reasonably approximate the top results, they suffer in terms of recall — one of the key advantages of dense retrieval. We introduce ‘LADR’ (Lexically-Accelerated Dense Retrieval), a simple-yet-effective approach that improves the efficiency of existing dense retrieval models without compromising on retrieval effectiveness. LADR uses lexical retrieval techniques to seed a dense retrieval exploration that uses a document proximity graph. Through extensive experiments, we find that LADR establishes a new dense retrieval effectiveness-efficiency Pareto frontier among approximate k nearest neighbor techniques. When tuned to take around 8ms per query in retrieval latency on our hardware, LADR consistently achieves both precision and recall that are on par with an exhaustive search on standard benchmarks. Importantly, LADR accomplishes this using only a single CPU — no hardware accelerators such as GPUs — which reduces the deployment cost of dense retrieval systems.
Dense retrieval models use bi-encoder network architectures for learning query and document representations. These representations are often in the form of a vector representation and their similarities are often computed using the dot product function. In this paper, we propose a new representation learning framework for dense retrieval. Instead of learning a vector for each query and document, our framework learns a multivariate distribution and uses negative multivariate KL divergence to compute the similarity between distributions. For simplicity and efficiency reasons, we assume that the distributions are multivariate normals and then train large language models to produce mean and variance vectors for these distributions. We provide a theoretical foundation for the proposed framework and show that it can be seamlessly integrated into the existing approximate nearest neighbor algorithms to perform retrieval efficiently. We conduct an extensive suite of experiments on a wide range of datasets, and demonstrate significant improvements compared to competitive dense retrieval models.
SESSION: Session 4 – Language Models
Large Language Models are Versatile Decomposers: Decomposing Evidence and Questions for Table-based Reasoning
Table-based reasoning has shown remarkable progress in a wide range of table-based tasks. It is a challenging task, which requires reasoning over both free-form natural language (NL) questions and (semi-)structured tabular data. However, previous table-based reasoning solutions usually suffer from significant performance degradation on ”huge” evidence (tables). In addition, most existing methods struggle to reason over complex questions since the essential information is scattered in different places. To alleviate the above challenges, we exploit large language models (LLMs) as decomposers for effective table-based reasoning, which (i) decompose huge evidence (a huge table) into sub-evidence (a small table) to mitigate the interference of useless information for table reasoning, and (ii) decompose a complex question into simpler sub-questions for text reasoning. First, we use a powerful LLM to decompose the evidence involved in the current question into the sub-evidence that retains the relevant information and excludes the remaining irrelevant information from the ”huge” evidence. Second, we propose a novel ”parsing-execution-filling” strategy to decompose a complex question into simper step-by-step sub-questions by generating intermediate SQL queries as a bridge to produce numerical and logical sub-questions with a powerful LLM. Finally, we leverage the decomposed sub-evidence and sub-questions to get the final answer with a few in-context prompting examples. Extensive experiments on three benchmark datasets (TabFact, WikiTableQuestion, and FetaQA) demonstrate that our method achieves significantly better results than competitive baselines for table-based reasoning. Notably, our method outperforms human performance for the first time on the TabFact dataset. In addition to impressive overall performance, our method also has the advantage of interpretability, where the returned results are to some extent tractable with the generated sub-evidence and sub-questions. For reproducibility, we release our source code and data at: https://github.com/AlibabaResearch/DAMO-ConvAI.
Query and point of interest (POI) matching is a core task in location-based services~(LBS), e.g., navigation maps. It connects users’ intent with real-world geographic information. Lately, pre-trained language models (PLMs) have made notable advancements in many natural language processing (NLP) tasks. To overcome the limitation that generic PLMs lack geographic knowledge for query-POI matching, related literature attempts to employ continued pre-training based on domain-specific corpus. However, a query generally describes the geographic context (GC) about its destination and contains mentions of multiple geographic objects like nearby roads and regions of interest (ROIs). These diverse geographic objects and their correlations are pivotal to retrieving the most relevant POI. Text-based single-modal PLMs can barely make use of the important GC and are therefore limited. In this work, we propose a novel method for query-POI matching, namely Multi-modal Geographic language model (MGeo), which comprises a geographic encoder and a multi-modal interaction module. Representing GC as a new modality, MGeo is able to fully extract multi-modal correlations to perform accurate query-POI matching. Moreover, there exists no publicly available query-POI matching benchmark. Intending to facilitate further research, we build a new open-source large-scale benchmark for this topic, i.e., Geographic TExtual Similarity (GeoTES). The POIs come from an open-source geographic information system (GIS) and the queries are manually generated by annotators to prevent privacy issues. Compared with several strong baselines, the extensive experiment results and detailed ablation analyses demonstrate that our proposed multi-modal geographic pre-training method can significantly improve the query-POI matching capability of PLMs with or without users’ locations. Our code and benchmark are publicly available at https://github.com/PhantomGrapes/MGeo.
Multimodal sentence summarization, aiming to generate a brief summary of the source sentence and image, is a new yet challenging task. Although existing methods have achieved compelling success, they still suffer from two key limitations: 1) lacking the adaptation of generative pre-trained language models for open-domain MMSS, and 2) lacking the explicit critical information modeling. To address these limitations, we propose a BART-MMSS framework, where BART is adopted as the backbone. To be specific, we propose a prompt-guided image encoding module to extract the source image feature. It leverages several soft to-be-learned prompts for image patch embedding, which facilitates the visual content injection to BART for open-domain MMSS tasks. Thereafter, we devise an explicit source critical token learning module to directly capture the critical tokens of the source sentence with the reference of the source image, where we incorporate explicit supervision to improve performance. Extensive experiments on a public dataset fully validate the superiority of our proposed method. In addition, the predicted tokens by the vision-guided key-token highlighting module can be easily understood by humans and hence improve the interpretability of our model.
Systematic review is a crucial method that has been widely used. by scholars from different research domains. However, screening for relevant scientific literature from paper candidates remains an extremely time-consuming process so the task of screening prioritization has been established to reduce the human workload. Various methods under the human-in-the-loop fashion are proposed to solve this task by using lexical features. These methods, even though achieving better performance than more sophisticated feature-based models such as BERT, omit rich and essential semantic information, therefore suffered from feature bias. In this study, we propose a novel framework SciMine to accelerate this screening process by capturing semantic feature representations from both background and the corpus. In particular, based on contextual representation learned from the pre-trained language models, our approach utilizes an autoencoder-based classifier and a feature-dependent classification module to extract general document-level and phrase-level information. Then a ranking ensemble strategy is used to combine these two complementary pieces of information. Experiments on five real-world datasets demonstrate that SciMine achieves state-of-the-art performance and comprehensive analysis further shows the efficacy of SciMine to solve feature bias.
Learning vectors that capture the meaning of concepts remains a fundamental challenge. Somewhat surprisingly, perhaps, pre-trained language models have thus far only enabled modest improvements to the quality of such concept embeddings. Current strategies for using language models typically represent a concept by averaging the contextualised representations of its mentions in some corpus. This is potentially sub-optimal for at least two reasons. First, contextualised word vectors have an unusual geometry, which hampers downstream tasks. Second, concept embeddings should capture the semantic properties of concepts, whereas contextualised word vectors are also affected by other factors. To address these issues, we propose two contrastive learning strategies, based on the view that whenever two sentences reveal similar properties, the corresponding contextualised vectors should also be similar. One strategy is fully unsupervised, estimating the properties which are expressed in a sentence from the neighbourhood structure of the contextualised word embeddings. The second strategy instead relies on a distant supervision signal from ConceptNet. Our experimental results show that the resulting vectors substantially outperform existing concept embeddings in predicting the semantic properties of concepts, with the ConceptNet-based strategy achieving the best results. These findings are furthermore confirmed in a clustering task and in the downstream task of ontology completion.
Some recent news recommendation (NR) methods introduce a Pre-trained Language Model (PLM) to encode news representation by following the vanilla pre-train and fine-tune paradigm with carefully-designed recommendation-specific neural networks and objective functions. Due to the inconsistent task objective with that of PLM, we argue that their modeling paradigm has not well exploited the abundant semantic information and linguistic knowledge embedded in the pre-training process. Recently, the pre-train, prompt, and predict paradigm, called prompt learning, has achieved many successes in natural language processing domain. In this paper, we make the first trial of this new paradigm to develop a Prompt Learning for News Recommendation (Prompt4NR) framework, which transforms the task of predicting whether a user would click a candidate news as a cloze-style mask-prediction task. Specifically, we design a series of prompt templates, including discrete, continuous, and hybrid templates, and construct their corresponding answer spaces to examine the proposed Prompt4NR framework. Furthermore, we use the prompt ensembling to integrate predictions from multiple prompt templates. Extensive experiments on the MIND dataset validate the effectiveness of our Prompt4NR with a set of new benchmark results.
SESSION: Session 5 – Exposure and Popularity Bias
Offline reinforcement learning (RL), a technology that offline learns a policy from logged data without the need to interact with online environments, has become a favorable choice in decision-making processes like interactive recommendation. Offline RL faces the value overestimation problem. To address it, existing methods employ conservatism, e.g., by constraining the learned policy to be close to behavior policies or punishing the rarely visited state-action pairs. However, when applying such offline RL to recommendation, it will cause a severe Matthew effect, i.e., the rich get richer and the poor get poorer, by promoting popular items or categories while suppressing the less popular ones. It is a notorious issue that needs to be addressed in practical recommender systems.
In this paper, we aim to alleviate the Matthew effect in offline RL-based recommendation. Through theoretical analyses, we find that the conservatism of existing methods fails in pursuing users’ long-term satisfaction. It inspires us to add a penalty term to relax the pessimism on states with high entropy of the logging policy and indirectly penalizes actions leading to less diverse states. This leads to the main technical contribution of the work: Debiased model-based Offline RL (DORL) method. Experiments show that DORL not only captures user interests well but also alleviates the Matthew effect. The implementation is available via https://github.com/chongminggao/DORL-codes
Counterfactual learning to rank (CLTR) relies on exposure-based inverse propensity scoring (IPS), a LTR-specific adaptation of IPS to correct for position bias. While IPS can provide unbiased and consistent estimates, it often suffers from high variance. Especially when little click data is available, this variance can cause CLTR to learn sub-optimal ranking behavior. Consequently, existing CLTR methods bring significant risks with them, as naively deploying their models can result in very negative user experiences.
We introduce a novel risk-aware CLTR method with theoretical guarantees for safe deployment. We apply a novel exposure-based concept of risk regularization to IPS estimation for LTR. Our risk regularization penalizes the mismatch between the ranking behavior of a learned model and a given safe model. Thereby, it ensures that learned ranking models stay close to a trusted model, when there is high uncertainty in IPS estimation, which greatly reduces the risks during deployment. Our experimental results demonstrate the efficacy of our proposed method, which is effective at avoiding initial periods of bad performance when little date is available, while also maintaining high performance at convergence. For the CLTR field, our novel exposure-based risk minimization method enables practitioners to adopt CLTR methods in a safer manner that mitigates many of the risks attached to previous methods.
Personalized news recommendation aims to recommend candidate news to the target user, according to the clicked news history. The user-news interaction data exhibits power-law distribution, however, existing works usually learn representations in Euclidean space which makes inconsistent capacities between data space and embedding space, leading to severe representation distortion problem. Besides, the existence of conformity bias, a potential cause of power-law distribution, may introduce biased guidance to learn user representations. In this paper, we propose a novel debiased method based on hyperbolic space, named HDNR, to tackle the above problems. Specifically, first, we employ hyperboloid model with exponential growth capacity to conduct user and news modeling, in order to solve inconsistent space capacities problem and obtain low distortion representations. Second, we design a re-weighting aggregation module to further mitigate conformity bias in data distribution, through considering local importance of the clicked news among contextual history and its global popularity degree simultaneously. Finally, we calculate the relevance score between target user and candidate news representations. We conduct experiments on two real-world news recommendation datasets MIND-Large, MIND-Small and empirical results demonstrate the effectiveness of our approach from multiple perspectives.
In the era of information explosion, numerous items emerge every day, especially in feed scenarios. Due to the limited system display slots and user browsing attention, various recommendation systems are designed not only to satisfy users’ personalized information needs but also to allocate items’ exposure. However, recent recommendation studies mainly focus on modeling user preferences to present satisfying results and maximize user interactions, while paying little attention to developing item-side fair exposure mechanisms for rational information delivery. This may lead to serious resource allocation problems on the item side, such as the Snowball Effect. Furthermore, unfair exposure mechanisms may hurt recommendation performance. In this paper, we call for a shift of attention from modeling user preferences to developing fair exposure mechanisms for items. We first conduct empirical analyses of feed scenarios to explore exposure problems between items with distinct uploaded times. This points out that unfair exposure caused by the time factor may be the major cause of the Snowball Effect. Then, we propose to explicitly model item-level customized timeliness distribution, Global Residual Value (GRV), for fair resource allocation. This GRV module is introduced into recommendations with the designed Timeliness-aware Fair Recommendation Framework (TaFR). Extensive experiments on two datasets demonstrate that TaFR achieves consistent improvements with various backbone recommendation models. By modeling item-side customized Global Residual Value, we achieve a fairer distribution of resources and, at the same time, improve recommendation performance.
SESSION: Session 6 – Sequential Recommendation – Part 2
Modeling user sequential behaviors have been demonstrated to be effective in promoting the recommendation performance. While previous work has achieved remarkable successes, they mostly assume that the training and testing distributions are consistent, which may contradict with the diverse and complex user preferences, and limit the recommendation performance in real-world scenarios. To alleviate this problem, in this paper, we propose a robust sequential recommender framework to overcome the potential distribution shift between the training and testing sets. In specific, we firstly simulate different training distributions via sample reweighting. Then, we minimize the largest loss induced by these distributions to optimize the ‘worst-case’ loss for improving the model robustness. Considering that there can be too many sample weights, which may introduce too much flexibility and be hard to optimize, we cluster the training samples based on both hard and soft strategies, and assign each cluster with a unified weight. At last, we analyze our framework by presenting the generalization error bound of the above minimax objective, which help us to better understand the proposed framework from the theoretical perspective. We conduct extensive experiments based on three real-world datasets to demonstrate the effectiveness of our proposed framework. To reproduce our experiments and promote this research direction, we have released our project at https://anonymousrsr.github.io/RSR/.
Transformer models have achieved remarkable success in sequential recommender systems (SRSs). However, computing the attention matrix in traditional dot-product attention mechanisms results in a quadratic complexity with sequence lengths, leading to high computational costs for long-term sequential recommendation. Motivated by the above observation, we propose a novel L2-Normalized Linear Attention for the Transformer-based Sequential Recommender Systems (LinRec), which theoretically improves efficiency while preserving the learning capabilities of the traditional dot-product attention. Specifically, by thoroughly examining the equivalence conditions of efficient attention mechanisms, we show that LinRec possesses linear complexity while preserving the property of attention mechanisms. In addition, we reveal its latent efficiency properties by interpreting the proposed LinRec mechanism through a statistical lens. Extensive experiments are conducted based on two public benchmark datasets, demonstrating that the combination of LinRec and Transformer models achieves comparable or even superior performance than state-of-the-art Transformer-based SRS models while significantly improving time and memory efficiency. The implementation code is available online at https://github.com/Applied-Machine-Learning-Lab/LinRec.>
Self-supervised learning (SSL) has been recently applied to sequential recommender systems to provide high-quality user representations. However, while facilitating the learning process recommender systems, SSL is not without security threats: carefully crafted inputs can poison the pre-trained models driven by SSL, thus reducing the effectiveness of the downstream recommendation model. This work shows that poisoning attacks against the pre-training stage threaten sequential recommender systems. Without any background knowledge of the model architecture and parameters, nor any API queries, our strategy proves the feasibility of poisoning attacks on mainstream SSL-based recommender schemes as well as on commonly used datasets. By injecting only a tiny amount of fake users, we get the target item recommended to real users more than thousands of times as before, demonstrating that recommender systems have a new attack surface due to SSL. We further show our attack is challenging for recommendation platforms to detect and defend. Our work highlights the weakness of self-supervised recommender systems and shows the necessity for researchers to be aware of this security threat. Our source code is available at https://github.com/CongGroup/Poisoning-SSL-based-RS.
The pre-training paradigm, i.e., learning universal knowledge across a wide spectrum of domains, has increasingly become a new de-facto practice in many fields, especially for transferring to new domains. The recent progress includes universal pre-training solutions for recommendation. However, we argue that the common treatment utilizing the masked language modeling or simple data augmentation via contrastive learning is not sufficient for pre-training a recommender system, since a user’s intent could be more complex than predicting the next word or item. It is more intuitive to go a step further by devising the multi-interest driven pre-training framework for universal user understanding. Nevertheless, incorporating multi-interest modeling in recommender system pre-training is non-trivial due to the dynamic, contextual, and temporary nature of the user interests, particularly when the users are from different domains. The limited effort on this line has greatly rendered it as an open question.
In this paper, we propose a novel Multi-Interest Pre-training with Sparse Capsule framework (named Miracle). Miracle performs a universal multi-interest modeling with a sparse capsule network and an interest-aware pre-training task. Specifically, we utilize a text-aware item embedding module, including an MoE adaptor and a deeply-contextual encoding component, to model contextual and transferable item representations. Then, we propose a sparse interest activation mechanism coupled with a position-aware capsule network for adaptive interest extraction. Furthermore, an interest-level contrastive pre-training task is introduced to guide the sparse capsule network to learn universal interests precisely. We conduct extensive experiments on eleven real-world datasets and eight baselines. The results show that our method significantly outperforms a series of SOTA on these benchmark datasets. The code is available at https://github.com/WHUIR/Miracle.
While some powerful neural network architectures (e.g., Transformer, Graph Neural Networks) have achieved improved performance in sequential recommendation with high-order item dependency modeling, they may suffer from poor representation capability in label scarcity scenarios. To address the issue of insufficient labels, Contrastive Learning (CL) has attracted much attention in recent methods to perform data augmentation through embedding contrasting for self-supervision. However, due to the hand-crafted property of their contrastive view generation strategies, existing CL-enhanced models i) can hardly yield consistent performance on diverse sequential recommendation tasks; ii) may not be immune to user behavior data noise. In light of this, we propose a simple yet effective Graph Masked AutoEncoder-enhanced sequential Recommender system (MAERec) that adaptively and dynamically distills global item transitional information for self-supervised augmentation. It naturally avoids the above issue of heavy reliance on constructing high-quality embedding contrastive views. Instead, an adaptive data reconstruction paradigm is designed to be integrated with the long-range item dependency modeling, for informative augmentation in sequential recommendation. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art baseline models and can learn more accurate representations against data noise and sparsity. Our implemented model code is available at https://github.com/HKUDS/MAERec.
Leading sequential recommendation (SeqRec) models adopt empirical risk minimization (ERM) as the learning framework, which inherently assumes that the training data (historical interaction sequences) and the testing data (future interactions) are drawn from the same distribution. However, such i.i.d. assumption hardly holds in practice, due to the online serving and dynamic nature of recommender system.For example, with the streaming of new data, the item popularity distribution would change, and the user preference would evolve after consuming some items. Such distribution shifts could undermine the ERM framework, hurting the model’s generalization ability for future online serving.
In this work, we aim to develop a generic learning framework to enhance the generalization of recommenders in the dynamic environment. Specifically, on top of ERM, we devise a Distributionally Robust Optimization mechanism for SeqRec (DROS). At its core is our carefully-designed distribution adaption paradigm, which considers the dynamics of data distribution and explores possible distribution shifts between training and testing. Through this way, we can endow the backbone recommenders with better generalization ability.It is worth mentioning that DROS is an effective model-agnostic learning framework, which is applicable to general recommendation scenarios.Theoretical analyses show that DROS enables the backbone recommenders to achieve robust performance in future testing data.Empirical studies verify the effectiveness against dynamic distribution shifts of DROS. Codes are anonymously open-sourced at https://github.com/YangZhengyi98/DROS.
SESSION: Session 7 – Knowledge Graphs for Recommendation
Multi-task Recommender Systems (MTRSs) has become increasingly prevalent in a variety of real-world applications due to their exceptional training efficiency and recommendation quality. However, conventional MTRSs often input all relevant feature fields without distinguishing their contributions to different tasks, which can lead to confusion and a decline in performance. Existing feature selection methods may neglect task relations or require significant computation during model training in multi-task setting. To this end, this paper proposes a novel Single-shot Feature Selection framework for MTRSs, referred to as MultiSFS, which is capable of selecting feature fields for each task while considering task relations in a single-shot manner. Specifically, MultiSFS first efficiently obtains task-specific feature importance through a single forward-backward pass. Then, a data-task bipartite graph is constructed to learn field-level task relations. Subsequently, MultiSFS merges the feature importance according to task relations and selects feature fields for different tasks. To demonstrate the effectiveness and properties of MultiSFS, we integrate it with representative MTRS models and evaluate on three real-world datasets. The implementation code is available online to ease reproducibility.
Session-based recommendation (SBR) has received increasing attention to predict the next item via extracting and integrating both global and local item-item relationships. However, there still exist some deficiencies in current works when capturing these two kinds of relationships. For global item-item relationships, the global graph constructed by most SBR is a pseudo-global graph, which may cause redundant mining of sequence relationships. For local item-item relationships, conventional SBR only mines the sequence patterns while ignoring the feature patterns, which may introduce noise when learning users’ interests. To address these problems, we propose a novel Knowledge-enhanced Multi-View Graph Neural Network (KMVG) by constructing three views, namely knowledge view, session view, and pairwise view. Specifically, benefiting from the rich semantic information in the knowledge graph (KG), we build a genuine global graph that is sequence-independent based on KG to mine the global item-item relationships in the knowledge view. Then, a session view is utilized to capture the contextual transitions among items as the sequence patterns of local item-item relationships, and a pairwise view is used to explore the feature commonality within a session as the feature patterns of the local item-item relationships. Extensive experiments on three real-world public datasets demonstrate the superiority of KMVG, showing that it outperforms the state-of-the-art baselines. Further analysis also reveals the effectiveness of KMVG in exploiting the item-item relationships under multiple views.
Knowledge graph (KG), which contains rich side information, becomes an essential part to boost the recommendation performance and improve its explainability. However, existing knowledge-aware recommendation methods directly perform information propagation on KG and user-item bipartite graph, ignoring the impacts of task-irrelevant knowledge propagation and vulnerability to interaction noise, which limits their performance. To solve these issues, we propose a robust knowledge-aware recommendation framework, called Knowledge-refined Denoising Network (KRDN), to prune the task-irrelevant knowledge associations and noisy implicit feedback simultaneously. KRDN consists of an adaptive knowledge refining strategy and a contrastive denoising mechanism, which are able to automatically distill high-quality KG triplets for aggregation and prune noisy implicit feedback respectively. Besides, we also design the self-adapted loss function and the gradient estimator for model optimization. The experimental results on three benchmark datasets demonstrate the effectiveness and robustness of KRDN over the state-of-the-art knowledge-aware methods like KGIN, MCCLK, and KGCL, and also outperform robust recommendation models like SGL and SimGCL. The implementations are available at https://github.com/xj-zhu98/KRDN.
As auxiliary collaborative signals, the entity connectivity and relation semanticity beneath knowledge graph (KG) triples can alleviate the data sparsity and cold-start issues of recommendation tasks. Thus many works consider obtaining user and item representations via information aggregation on graph-structured data within Euclidean space. However, the scale-free graphs (e.g., KGs) inherently exhibit non-Euclidean geometric topologies, such as tree-like and circle-like structures. The existing recommendation models built in a single type of embedding space do not have enough capacity to embrace various geometric patterns, consequently, resulting in suboptimal performance. To address this limitation, we propose a KG-aware recommendation model with mixed-curvature manifolds interaction learning, namely CurvRec. On the one hand, it aims to preserve various global geometric structures in KG with mixed-curvature manifold spaces as the backbone. On the other hand, we integrate Ricci curvature into graph convolutional networks (GCNs) to capture local geometric structural properties when aggregating neighbor nodes. Besides, to exploit the expressive spatial features in KG, we incorporate interaction learning to ensure the geometric message passing between curved manifolds. Specifically, we adopt curvature-aware geodesic distance metrics to maximize the mutual information between Euclidean space and non-Euclidean spaces. Through extensive experiments, we demonstrate that the proposed CurvRec outperforms state-of-the-art baselines.
SESSION: Session 8 – POI Recommendation
EEDN: Enhanced Encoder-Decoder Network with Local and Global Context Learning for POI Recommendation
The point-of-interest (POI) recommendation predicts users’ destinations, which might be of interest to users and has attracted considerable attention as one of the major applications in location-based social networks (LBSNs). Recent work on graph-based neural networks (GNN) or matrix factorization-based (MF) approaches has resulted in better representations of users and POIs to forecast users’ latent preferences. However, they still suffer from the implicit feedback and cold-start problems of check-in data, as they cannot capture both local and global graph-based relations among users (or POIs) simultaneously, and the cold-start neighbors are not handled properly during graph convolution in GNN. In this paper, we propose an enhanced encoder-decoder network (EEDN) to exploit rich latent features between users, POIs, and interactions between users and POIs for POI recommendation. The encoder of EEDN utilizes a hybrid hypergraph convolution to enhance the aggregation ability of each graph convolution step and learns to derive more robust cold-start-aware user representations. In contrast, the decoder mines local and global interactions by both graph- and sequential-based patterns for modeling implicit feedback, especially to alleviate exposure bias. Extensive experiments in three public real-world datasets demonstrate that EEDN outperforms state-of-the-art methods. Our source codes and data are released at https://github.com/WangXFng/EEDN
Next Point-of-Interest (POI) recommendation is an essential part of the flourishing location-based applications, where the demands of users are not only conditioned by their recent check-in behaviors but also by the critical influence stemming from geographical dependencies among POIs. Existing methods leverage Graph Neural Networks with the aid of pre-defined POI graphs to capture such indispensable correlations for modeling user preferences, assuming that the appropriate geographical dependencies among POIs could be pre-determined. However, the pre-defined graph structures are always far from the optimal graph topology due to noise and adaptability issues, which may decrease the expressivity of learned POI representations as well as the credibility of modeling user preferences. In this paper, we propose a novel Adaptive Graph Representation-enhanced Attention Network (AGRAN) for next POI recommendation, which explores the utilization of graph structure learning to replace the pre-defined static graphs for learning more expressive representations of POIs. In particular, we develop an adaptive POI graph matrix and learn it via similarity learning with POI embeddings, automatically capturing the underlying geographical dependencies for representation learning. Afterward, we incorporate the learned representations of POIs and personalized spatial-temporal information with an extension to the self-attention mechanism for capturing dynamic user preferences. Extensive experiments conducted on two real-world datasets validate the superior performance of our proposed method over state-of-the-art baselines.
Next Point-of-Interest (POI) recommendation task focuses on predicting the immediate next position a user would visit, thus providing appealing location advice. In light of this, graph neural networks (GNNs) based models have recently been emerging as breakthroughs for this task due to their ability to learn global user preferences and alleviate cold-start challenges. Nevertheless, most existing methods merely focus on the relations between POIs, neglecting the higher-order information including user trajectories and the collaborative relations among trajectories. In this paper, we propose the Spatio-Temporal HyperGraph Convolutional Network (STHGCN). This model leverages a hypergraph to capture the trajectory-grain information and learn from user’s historical trajectories (intra-user) as well as collaborative trajectories from other users (inter-user). Furthermore, a novel hypergraph transformer is introduced to effectively combine the hypergraph structure encoding with spatio-temporal information. Extensive experiments on real-world datasets demonstrate that our model outperforms the existing state-of-the-art methods and further analysis confirms the effectiveness in alleviating cold-start issues and achieving improved performance for both short and long trajectories.
With the raised privacy concerns and rigorous data regulations, federated learning has become a hot collaborative learning paradigm for the recommendation model without sharing the highly sensitive POI data. However, the time-sensitive, heterogeneous, and limited POI records seriously restrict the development of federated POI recommendation. To this end, in this paper, we design the fine-grained preference-aware personalized federated POI recommendation framework, namely PrefFedPOI, under extremely sparse historical trajectories to address the above challenges. In details, PrefFedPOI extracts the fine-grained preference of current time slot by combining historical recent preferences and periodic preferences within each local client. Due to the extreme lack of POI data in some time slots, a data amount aware selective strategy is designed for model parameters uploading. Moreover, a performance enhanced clustering mechanism with reinforcement learning is proposed to capture the preference relatedness among all clients to encourage the positive knowledge sharing. Furthermore, a clustering teacher network is designed for improving efficiency by clustering guidance. Extensive experiments are conducted on two diverse real-world datasets to demonstrate the effectiveness of proposed PrefFedPOI comparing with state-of-the-arts. In particular, personalized PrefFedPOI can achieve 7% accuracy improvement on average among data-sparsity clients.
As an indispensable personalized service in Location-based Social Networks (LBSNs), the next Point-of-Interest (POI) recommendation aims to help people discover attractive and interesting places. Currently, most POI recommenders are based on the conventional centralized paradigm that heavily relies on the cloud to train the recommendation models with large volumes of collected users’ sensitive check-in data. Although a few recent works have explored on-device frameworks for resilient and privacy-preserving POI recommendations, they invariably hold the assumption of model homogeneity for parameters/gradients aggregation and collaboration. However, users’ mobile devices in the real world have various hardware configurations (e.g., compute resources), leading to heterogeneous on-device models with different architectures and sizes. In light of this, We propose a novel on-device POI recommendation framework, namely Model-Agnostic Collaborative learning for on-device POI recommendation (MAC), allowing users to customize their own model structures (e.g., dimension & number of hidden layers). To counteract the sparsity of on-device user data, we propose to pre-select neighbors for collaboration based on physical distances, category-level preferences, and social networks. To assimilate knowledge from the above-selected neighbors in an efficient and secure way, we adopt the knowledge distillation framework with mutual information maximization. Instead of sharing sensitive models/gradients, clients in MAC only share their soft decisions on a preloaded reference dataset. To filter out low-quality neighbors, we propose two sampling strategies, performance-triggered sampling and similarity-based sampling, to speed up the training process and obtain optimal recommenders. In addition, we design two novel approaches to generate more effective reference datasets while protecting users’ privacy. Extensive experiments on two datasets have shown the superiority of MAC over advanced baselines.
SESSION: Session 9 – Music, video, Social Media
Recent years have witnessed the rapid development of online micro-video platforms, in which the recommender system plays an essential role in overcoming the information overloading problem and providing personalized content for users. Although some progress has been achieved in the micro-video recommendation, there are still some limitations in learning the representations of user interests and video features. Specifically, the user modeling in existing works is performed at a coarse-grained level, i.e., video level. However, in micro-video recommendation, the user feedback is at a continuous form—users can skip over a video at each frame—which reveals fine-grained user preferences. In this work, we approach the problem of learning fine-grained user preferences for micro-video recommendation by first collecting two real-world datasets. To address the challenges of preference modeling and weak supervision signal, we propose a solution named FRAME (short for Fine-gRAined preference-modeling for Micro-video rEcommendation). Specifically, we first adopt visual feature extraction and transformation to maintain the fine-grained video embeddings. We then propose graph convolution layers to learn the user preference from complex and fine-grained user-clip relations, and hybrid-supervision objectives for enhancing the supervision signal. The experimental results on two collected real-world datasets demonstrate the effectiveness of our proposed model. We release the datasets and codes in https://github.com/tsinghua-fib-lab/FRAME, which we believe can benefit the community.
Multimedia content is of predominance in the modern Web era. Many recommender models have been proposed to investigate how users interact with items which are represented in diverse modalities. In real scenarios, different modalities reveal different aspects of item attributes and usually possess different importance to user purchase decisions. However, it is difficult for models to figure out users’ true preference towards different modalities since there exists strong statistical correlation between different modalities. Even worse, the strong statistical correlation might mislead models to learn the spurious preference towards inconsequential modalities. As a result, when data (modal features) distribution shifts, the learned spurious preference might not guarantee to be as effective on inference as on the training set.
Given that the statistical correlation between different modalities is a major cause of this problem, we propose a novel MOdality DEcorrelating STable learning framework, MODEST for brevity, to learn users’ stable preference. Inspired by sample re-weighting techniques, the proposed method aims to estimate a weight for each item, such that the features from different modalities in the weighted distribution are decorrelated. We adopt Hilbert Schmidt Independence Criterion (HSIC) as independence testing measure which is a kernel-based method capable of evaluating the correlation degree between two multi-dimensional and non-linear variables. Moreover, by utilizing adaptive gradient mask, we empower HSIC with the ability to measure task-relevant correlation. Overall, in the training phase, we alternately optimize (1) model parameters via minimizing weighted Bayesian Personalized Ranking (BPR) loss and (2) sample weights via minimizing modal correlation in the weighted distribution (quantified through HSIC loss). In the inference phase, we can directly use the the trained multimedia recommendation models to make recommendations. Our method could be served as a play-and-plug module for existing multimedia recommendation backbones. Extensive experiments on four public datasets and four state-of-the-art multimedia recommendation backbones unequivocally show that our proposed method can improve the performances by a large margin.
In the field of music information retrieval (MIR), cover song identification (CSI) is a challenging task that aims to identify cover versions of a query song from a massive collection. Existing works still suffer from high intra-song variances and inter-song correlations, due to the entangled nature of version-specific and version-invariant factors in their modeling. In this work, we set the goal of disentangling version-specific and version-invariant factors, which could make it easier for the model to learn invariant music representations for unseen query songs. We analyze the CSI task in a disentanglement view with the causal graph technique, and identify the intra-version and inter-version effects biasing the invariant learning. To block these effects, we propose the disentangled music representation learning framework (DisCover) for CSI. DisCover consists of two critical components: (1) Knowledge-guided Disentanglement Module (KDM) and (2) Gradient-based Adversarial Disentanglement Module (GADM), which block intra-version and inter-version biased effects, respectively. KDM minimizes the mutual information between the learned representations and version-variant factors that are identified with prior domain knowledge. GADM identifies version-variant factors by simulating the representation transitions between intra-song versions, and exploits adversarial distillation for effect blocking. Extensive comparisons with best-performing methods and in-depth analysis demonstrate the effectiveness of DisCover and the and necessity of disentanglement for CSI.
Music streaming services often aim to recommend songs for users to extend the playlists they have created on these services. However, extending playlists while preserving their musical characteristics and matching user preferences remains a challenging task, commonly referred to as Automatic Playlist Continuation (APC). Besides, while these services often need to select the best songs to recommend in real-time and among large catalogs with millions of candidates, recent research on APC mainly focused on models with few scalability guarantees and evaluated on relatively small datasets. In this paper, we introduce a general framework to build scalable yet effective APC models for large-scale applications. Based on a represent-then-aggregate strategy, it ensures scalability by design while remaining flexible enough to incorporate a wide range of representation learning and sequence modeling techniques, e.g., based on Transformers. We demonstrate the relevance of this framework through in-depth experimental validation on Spotify’s Million Playlist Dataset (MPD), the largest public dataset for APC. We also describe how, in 2022, we successfully leveraged this framework to improve APC in production on Deezer. We report results from a large-scale online A/B test on this service, emphasizing the practical impact of our approach in such a real-world application.
Text-to-video(T2V) retrieval aims to find relevant videos from text queries. The recently introduced Contrastive Language Image Pretraining (CLIP), a pretrained language-vision model trained on large-scale image and caption pairs, has been extensively studied in the literature for this task. Existing studies on T2V task have aimed to transfer the CLIP knowledge and focus on enhancing retrieval performance through fine-grained representation learning. While fine-grained contrast has achieved some remarkable results, less attention has been paid to coarse-grained contrasts. To this end, we propose a method called Graph Patch Spreading (GPS) to aggregate patches across frames at the coarse-grained level. We apply GPS to our proposed framework called Multi-Encoder Multi-Expert (MEME) framework. Our proposed scheme is general enough to be applied to any existing CLIP-based video-text retrieval models. We demonstrate the effectiveness of our method on existing models over the benchmark datasets MSR-VTT, MSVD, and LSMDC datasets. Our code can be found at https://github.com/kang7734/MEME__.
Twitter bot detection has become a crucial task in efforts to combat online misinformation, mitigate election interference, and curb malicious propaganda. However, advanced Twitter bots often attempt to mimic the characteristics of genuine users through feature manipulation and disguise themselves to fit in diverse user communities, posing challenges for existing Twitter bot detection models. To this end, we propose BotMoE, a Twitter bot detection framework that jointly utilizes multiple user information modalities (metadata, textual content, network structure) to improve the detection of deceptive bots. Furthermore, BotMoE incorporates a community-aware Mixture-of-Experts (MoE) layer to improve domain generalization and adapt to different Twitter communities. Specifically, BotMoE constructs modal-specific encoders for metadata features, textual content, and graph structure, which jointly model Twitter users from three modal-specific perspectives. We then employ a community-aware MoE layer to automatically assign users to different communities and leverage the corresponding expert networks. Finally, user representations from metadata, text, and graph perspectives are fused with an expert fusion layer, combining all three modalities while measuring the consistency of user information. Extensive experiments demonstrate that BotMoE significantly advances the state-of-the-art on three Twitter bot detection benchmarks. Studies also confirm that BotMoE captures advanced and evasive bots, alleviates the reliance on training data, and better generalizes to new and previously unseen user communities.
SESSION: Session 10 – Sparsity Problem
Modern recommender systems often deal with a variety of user interactions, e.g., click, forward, purchase, etc., which requires the underlying recommender engines to fully understand and leverage multi-behavior data from users. Despite recent efforts towards making use of heterogeneous data, multi-behavior recommendation still faces great challenges. Firstly, sparse target signals and noisy auxiliary interactions remain an issue. Secondly, existing methods utilizing self-supervised learning (SSL) to tackle the data sparsity neglect the serious optimization imbalance between the SSL task and the target task. Hence, we propose a Multi-Behavior Self-Supervised Learning (MBSSL) framework together with an adaptive optimization method. Specifically, we devise a behavior-aware graph neural network incorporating the self-attention mechanism to capture behavior multiplicity and dependencies. To increase the robustness to data sparsity under the target behavior and noisy interactions from auxiliary behaviors, we propose a novel self-supervised learning paradigm to conduct node self-discrimination at both inter-behavior and intra-behavior levels. In addition, we develop a customized optimization strategy through hybrid manipulation on gradients to adaptively balance the self-supervised learning task and the main supervised recommendation task. Extensive experiments on five real-world datasets demonstrate the consistent improvements obtained by MBSSL over ten state-of-the-art (SOTA) baselines. We release our model implementation at: https://github.com/Scofield666/MBSSL.git.
Text classification is a fundamental problem in information retrieval with many real-world applications, such as predicting the topics of online articles and the categories of e-commerce product descriptions. However, low-resource text classification, with few or no labeled samples, poses a serious concern for supervised learning. Meanwhile, many text data are inherently grounded on a network structure, such as a hyperlink/citation network for online articles, and a user-item purchase network for e-commerce products. These graph structures capture rich semantic relationships, which can potentially augment low-resource text classification. In this paper, we propose a novel model called Graph-Grounded Pre-training and Prompting (G2P2) to address low-resource text classification in a two-pronged approach. During pre-training, we propose three graph interaction-based contrastive strategies to jointly pre-train a graph-text model; during downstream classification, we explore prompting for the jointly pre-trained model to achieve low-resource classification. Extensive experiments on four real-world datasets demonstrate the strength of G2P2 in zero- and few-shot low-resource text classification tasks
Recently, Multi-Scenario Learning (MSL) is widely used in recommendation and retrieval systems in the industry because it facilitates transfer learning from different scenarios, mitigating data sparsity and reducing maintenance cost. These efforts produce different MSL paradigms by searching more optimal network structure, such as Auxiliary Network, Expert Network, and Multi-Tower Network. It is intuitive that different scenarios could hold their specific characteristics, activating the user’s intents quite differently. In other words, different kinds of auxiliary features would bear varying importance under different scenarios. With more discriminative feature representations refined in a scenario-aware manner, better ranking performance could be easily obtained without expensive search for the optimal network structure. Unfortunately, this simple idea is mainly overlooked but much desired in real-world systems.
To this end, in this paper, we propose a multi-scenario ranking framework with adaptive feature learning (named MARIA). Specifically, MARIA is devised to inject the scenario semantics in the bottom part of the network to derive more discriminative feature representations. There are three components designed in MARIA for this purpose: feature scaling, feature refinement, and feature correlation modeling. The purpose of feature scaling is to highlight the scenario-relevant fields and also suppress the irrelevant ones. Then, the feature refinement utilizes an automatic refiner selection subnetwork for each feature field, such that the high-level semantics with respect to the scenario can be extracted with the optimal expert. Afterwards, we further explicitly derive the feature correlations across fields as complementary signals. The resultant representations are then fed into a simple MoE structure with an additional scenario-shared tower for final prediction. Experiments on two large-scale real-world datasets demonstrate the superiority of MARIA against several state-of-the-art baselines for both product search and recommendation. Further analysis also validates the rationality of adaptive feature learning under a multi-scenario scheme. Moreover, our A/B test results on the Alibaba search advertising platform also demonstrate that MARIA is superior in production environments.
LOAM: Improving Long-tail Session-based Recommendation via Niche Walk Augmentation and Tail Session Mixup
Session-based recommendation aims to predict the user’s next action based on anonymous sessions without using side information. Most of the real-world session datasets are sparse and have long-tail item distribution. Although long-tail item recommendation plays a crucial role in improving user satisfaction, only a few methods have been proposed to take the long-tail session recommendation into consideration. Previous works in handling data sparsity problems are mostly limited to self-supervised learning techniques with heuristic augmentation which can ruin the original characteristic of session datasets, sequential and co-occurrences, and make noisier short sessions by dropping items and cropping sequences. We propose a novel method, LOAM, improving LOng-tail session-based recommendation via niche walk Augmentation and tail session Mixup, that alleviates popularity bias and enhances long-tail recommendation performance. LOAM consists of two modules, Niche Walk Augmentation (NWA) and Tail Session Mixup (TSM). NWA can generate synthetic sessions considering long-tail distribution which are likely to be found in original datasets, unlike previous heuristic methods, and expose a recommender model to various item transitions with global information. This improves the item coverage of recommendations. TSM makes the model more generalized and robust by interpolating sessions at the representation level. It encourages the recommender system to predict niche items with more diversity and relevance. We conduct extensive experiments with four real-world datasets and verify that our methods greatly improve tail performance while balancing overall performance.
Beyond accuracy, there are a variety of aspects to the quality of recommender systems, such as diversity, fairness, and robustness. We argue that many of the prevalent problems in recommender systems are partly due to low-dimensionality of user and item embeddings, particularly when dot-product models, such as matrix factorization, are used.
In this study, we showcase empirical evidence suggesting the necessity of sufficient dimensionality for user/item embeddings to achieve diverse, fair, and robust recommendation. We then present theoretical analyses of the expressive power of dot-product models. Our theoretical results demonstrate that the number of possible rankings expressible under dot-product models is exponentially bounded by the dimension of item factors. We empirically found that the low-dimensionality contributes to a popularity bias, widening the gap between the rank positions of popular and long-tail items; we also give a theoretical justification for this phenomenon.
Two-tower models are a prevalent matching framework for recommendation, which have been widely deployed in industrial applications. The success of two-tower matching attributes to its efficiency in retrieval among a large number of items, since the item tower can be precomputed and used for fast Approximate Nearest Neighbor (ANN) search. However, it suffers two main challenges, including limited feature interaction capability and reduced accuracy in online serving. Existing approaches attempt to design novel late interactions instead of dot products, but they still fail to support complex feature interactions or lose retrieval efficiency. To address these challenges, we propose a new matching paradigm named SparCode, which supports not only sophisticated feature interactions but also efficient retrieval. Specifically, SparCode introduces an all-to-all interaction module to model fine-grained query-item interactions. Besides, we design a discrete code-based sparse inverted index jointly trained with the model to achieve effective and efficient model inference. Extensive experiments have been conducted on open benchmark datasets to demonstrate the superiority of our framework. The results show that SparCode significantly improves the accuracy of candidate item matching while retaining the same level of retrieval efficiency with two-tower models.
SESSION: Session 11 – Evaluation
A well-known problem when learning from user clicks are inherent biases prevalent in the data, such as position or trust bias. Click models are a common method for extracting information from user clicks, such as document relevance in web search, or to estimate click biases for downstream applications such as counterfactual learning-to-rank, ad placement, or fair ranking. Recent work shows that the current evaluation practices in the community fail to guarantee that a well-performing click model generalizes well to downstream tasks in which the ranking distribution differs from the training distribution, i.e., under covariate shift. In this work, we propose an evaluation metric based on conditional independence testing to detect a lack of robustness to covariate shift in click models. We introduce the concept of debiasedness and a metric for measuring it. We prove that debiasedness is a necessary condition for recovering unbiased and consistent relevance scores and for the invariance of click prediction under covariate shift. In extensive semi-synthetic experiments, we show that our proposed metric helps to predict the downstream performance of click models under covariate shift and is useful in an off-policy model selection setting.
Numerous pre-training techniques for visual document understanding (VDU) have recently shown substantial improvements in performance across a wide range of document tasks. However, these pre-trained VDU models cannot guarantee continued success when the distribution of test data differs from the distribution of training data. In this paper, to investigate how robust existing pre-trained VDU models are to various distribution shifts, we first develop an out-of-distribution (OOD) benchmark termed Do-GOOD for the fine-Grained analysis on Document image-related tasks specifically. The Do-GOOD benchmark defines the underlying mechanisms that result in different distribution shifts and contains 9 OOD datasets covering 3 VDU related tasks, e.g., document information extraction, classification and question answering. We then evaluate the robustness and perform a fine-grained analysis of 5 latest VDU pre-trained models and 2 typical OOD generalization algorithms on these OOD datasets. Results from the experiments demonstrate that there is a significant performance gap between the in-distribution (ID) and OOD settings for document images, and that fine-grained analysis of distribution shifts can reveal the brittle nature of existing pre-trained VDU models and OOD generalization algorithms. The code and datasets for our Do-GOOD benchmark can be found at https://github.com/MAEHCM/Do-GOOD.
Effective queries are crucial to minimising the time and cost of medical systematic reviews, as all retrieved documents must be judged for relevance. Boolean queries, developed by expert librarians, are the standard for systematic reviews. They guarantee reproducible and verifiable retrieval and more control than free-text queries. However, the result sets of Boolean queries are unranked and difficult to control due to the strict Boolean operators. We address these problems in a single unified retrieval model by formulating a class of smooth operators that are compatible with and extend existing Boolean operators. Our smooth operators overcome several shortcomings of previous extensions of the Boolean retrieval model. In particular, our operators are independent of the underlying ranking function, so that exact-match and large language model rankers can be combined in the same query. We found that replacing Boolean operators with equivalent or similar smooth operators often improves the effectiveness of queries. Their properties make tuning a query to precision or recall intuitive and allow greater control over how documents are retrieved. This additional control leads to more effective queries and reduces the cost of systematic reviews.
SESSION: Session 12 – Knowledge Graphs – Part 1
To address the scarcity of massive labeled data, cross-domain named entity recognition (cross-domain NER) attracts increasing attention. Recent studies focus on decomposing NER into two separate tasks (i.e., entity span detection and entity type classification) to reduce the complexity of the cross-domain transfer. Despite the promising results, there still exists room for improvement. In particular, the rich domain-shared syntactic and semantic information, which are respectively important for entity span detection and entity type classification, are still underutilized. In light of these two challenges, we propose applying graph attention networks (GATs) to encode the above two kinds of information. Moreover, considering that GATs mainly operate in the Euclidean space, which may fail to capture the latent hierarchical relations among words for learning high-quality word representations, we further propose to embed words into Hyperbolic spaces. Finally, a decouple hyperbolic graph attention network (DH-GAT) is introduced for cross-domain NER. Empirical results on 10 domain pairs show that DH-GAT achieves state-of-the-art performance on several standard metrics, and further analyses are presented to better understand each component’s effectiveness.
Many real-world graph learning tasks require handling dynamic graphs where new nodes and edges emerge. Dynamic graph learning methods commonly suffer from the catastrophic forgetting problem, where knowledge learned for previous graphs is overwritten by updates for new graphs. To alleviate the problem, continual graph learning methods are proposed. However, existing continual graph learning methods aim to learn new patterns and maintain old ones with the same set of parameters of fixed size, and thus face a fundamental tradeoff between both goals. In this paper, we propose Parameter Isolation GNN (PI-GNN) for continual learning on dynamic graphs that circumvents the tradeoff via parameter isolation and expansion. Our motivation lies in that different parameters contribute to learning different graph patterns. Based on the idea, we expand model parameters to continually learn emerging graph patterns. Meanwhile, to effectively preserve knowledge for unaffected patterns, we find parameters that correspond to them via optimization and freeze them to prevent them from being rewritten. Experiments on eight real-world datasets corroborate the effectiveness of PI-GNN compared to state-of-the-art baselines.
In this paper, we propose neural-symbolic graph databases (NSGDs) that extends traditional graph data with content and structural embeddings in every node. The content embeddings can represent unstructured data (e.g., images, videos, and texts), while structural embeddings can be used to deal with incomplete graphs. We can advocate machine learning models (e.g., deep learning) to transform unstructured data and graph nodes to these embeddings. NSGDs can support a wide range of applications (e.g., online recommendation and natural language question answering) in social-media networks, multi-modal knowledge graphs and etc. As a typical search over graphs, we study subgraph search over a large NSGD, called neural-symbolic subgraph matching (NSMatch) that includes a novel ranking search function. Specifically, we develop a general algorithmic framework to process NSMatch efficiently. Using real-life multi-modal graphs, we experimentally verify the effectiveness, scalability and efficiency of NSMatch.
Learning representations for temporal knowledge graphs (TKGs) is a fundamental task. Most existing methods regard TKG as a sequence of static snapshots and recurrently learn representations by retracing the previous snapshots. However, new knowledge can be continuously accrued to TKGs as streams. These methods either cannot handle new entities or fail to update representations in real time, making them unfeasible to adapt to the streaming scenarios. In this paper, we propose a lightweight framework called StreamE towards the efficient generation of TKG representations in streaming scenarios. To reduce the parameter size, entity representations in StreamE are decoupled from the model training to serve as the memory module to store the historical information of entities. To achieve efficient update and generation, the process of generating representations is decoupled as two functions in StreamE. An update function is learned to incrementally update entity representations based on the newly-arrived knowledge and a read function is learned to predict the future semantics of entity representations. The update function avoids the recurrent modeling paradigm and thus gains high efficiency while the read function considers multiple semantic change properties. We further propose a joint training strategy with two temporal regularizations to effectively optimize the framework. Experimental results show that StreamE can achieve better performance than baseline methods with 100x faster in inference, 25x faster in training, and only 1/5 parameter size, which demonstrates its superiority. Code is available at https://github.com/zjs123/StreamE.
SESSION: Session 13 – Conversational Search and Recommendation
This research aims to explore various methods for assessing user feedback in mixed-initiative conversational search (CS ) systems. While CS systems enjoy profuse advancements across multiple aspects, recent research fails to successfully incorporate feedback from the users. One of the main reasons for that is the lack of system-user conversational interaction data. To this end, we propose a user simulator-based framework for multi-turn interactions with a variety of mixed-initiative CS systems. Specifically, we develop a user simulator, dubbed ConvSim, that, once initialized with an information need description, is capable of providing feedback to system’s responses, as well as answering potential clarifying questions. Our experiments on a wide variety of state-of-the-art passage retrieval and neural re-ranking models show that effective utilization of user feedback can lead to 16% retrieval performance increase in terms of nDCG@3. Moreover, we observe consistent improvements as the number of feedback rounds increases (35% relative improvement in terms of nDCG@3 after three rounds). This points to a research gap in the development of specific feedback processing modules and opens a potential for significant advancements in CS. To support further research in the topic, we release over 30 000 transcripts of system-simulator interactions based on well-established CS datasets.
Explainable Conversational Question Answering over Heterogeneous Sources via Iterative Graph Neural Networks
In conversational question answering, users express their information needs through a series of utterances with incomplete context. Typical ConvQA methods rely on a single source (a knowledge base (KB), or a text corpus, or a set of tables), thus being unable to benefit from increased answer coverage and redundancy of multiple sources. Our method EXPLAIGNN overcomes these limitations by integrating information from a mixture of sources with user-comprehensible explanations for answers. It constructs a heterogeneous graph from entities and evidence snippets retrieved from a KB, a text corpus, web tables, and infoboxes. This large graph is then iteratively reduced via graph neural networks that incorporate question-level attention, until the best answers and their explanations are distilled. Experiments show that EXPLAIGNN improves performance over state-of-the-art baselines. A user study demonstrates that derived answers are understandable by end users.
Conversational recommendation systems (CRS) aim to interactively acquire user preferences and accordingly recommend items to users. Accurately learning the dynamic user preferences is of crucial importance for CRS. Previous works learn the user preferences with pairwise relations from the interactive conversation and item knowledge, while largely ignoring the fact that factors for a relationship in CRS are multiplex. Specifically, the user likes/dislikes the items that satisfy some attributes (Like/Dislike view). Moreover social influence is another important factor that affects user preference towards the item (Social view), while is largely ignored by previous works in CRS. The user preferences from these three views are inherently different but also correlated as a whole. The user preferences from the same views should be more similar than that from different views. The user preferences from Like View should be similar to Social View while different from Dislike View. To this end, we propose a novel model, namely Multi-view Hypergraph Contrastive Policy Learning (MHCPL). Specifically, MHCPL timely chooses useful social information according to the interactive history and builds a dynamic hypergraph with three types of multiplex relations from different views. The multiplex relations in each view are successively connected according to their generation order in the interactive conversation. A hierarchical hypergraph neural network is proposed to learn user preferences by integrating information of the graphical and sequential structure from the dynamic hypergraph. A cross-view contrastive learning module is proposed to maintain the inherent characteristics and the correlations of user preferences from different views. Extensive experiments conducted on benchmark datasets demonstrate that MHCPL outperforms the state-of-the-art methods.
SESSION: Session 14 – Efficiency
An Effective, Efficient, and Scalable Confidence-based Instance Selection Framework for Transformer-Based Text Classification
Transformer-based deep learning is currently the state-of-the-art in many NLP and IR tasks. However, fine-tuning such Transformers for specific tasks, especially in scenarios of ever-expanding volumes of data with constant re-training requirements and budget constraints, is costly (computationally and financially) and energy-consuming. In this paper, we focus on Instance Selection (IS) – a set of methods focused on selecting the most representative documents for training, aimed at maintaining (or improving) classification effectiveness while reducing total time for training (or fine-tuning). We propose E2SC-IS — Effective, Efficient, and Scalable Confidence-Based IS — a two-step framework with a particular focus on Transformers and large datasets. E2SC-IS estimates the probability of each instance being removed from the training set based on scalable, fast, and calibrated weak classifiers. E2SC-IS also exploits iterative heuristics to estimate a near-optimal reduction rate. Our solution can reduce the training sets by 29% on average while maintaining the effectiveness in all datasets, with speedup gains up to 70%, scaling for very large datasets (something that the baselines cannot do).
In an edge storage system, popular data can be stored on edge servers to enable low-latency data retrieval for nearby users. Suffering from constrained storage capacities, edge servers must process users’ data requests collaboratively. For sourcing data, it is essential to find out which edge servers in the system have the requested data. In this paper, we make the first attempt to study this edge data query (EDQ) problem and present EDIndex, a distributed Edge Data Indexing system to enable fast data queries at the edge. First, we introduce a new index structure named Counting Bloom Filter (CBF) tree for facilitating edge data queries. Then, to improve query performance, we enhance EDIndex with a novel index structure named hierarchical Counting Bloom Filter (HCBF) tree. In EDIndex, each edge server maintains an HCBF tree that indexes the data stored on nearby edge servers to facilitate data sourcing between edge servers at the edge. The results of extensive experiments conducted on an edge storage system comprised of 90 edge servers demonstrate that EDIndex 1) takes up to 8.8x less time to answer edge data queries compared with state-of-the-art edge indexing systems; and 2) can be implemented in practice with a high query accuracy at low initialization and maintenance overheads.
Recently, numerous proxy hash code based methods, which sufficiently exploit the label information of data to supervise the training of hashing models, have been proposed. Although these methods have made impressive progress, their generating processes of proxy hash codes are based only on the class information of the dataset or labels of data but do not take the data themselves into account. Therefore, these methods will probably generate some inappropriate proxy hash codes, thus damaging the retrieval performance of the hash models. To solve the aforementioned problem, we propose a novel Data-Aware Proxy Hashing for cross-modal retrieval, called DAPH. Specifically, our proposed method first train a data-aware proxy network that takes the data points, label vectors of data, and the class vectors of the dataset as inputs to generate class-based data-aware proxy hash codes, label-fused image-aware proxy hash codes and label-fused text-aware proxy hash codes. Then, we propose a novel hash loss that exploits the three types of data-aware proxy hash codes to supervise the training of modality-specific hashing networks. After training, DAPH is able to generate discriminate hash codes with the semantic information preserved adequately. Extensive experiments on three benchmark datasets show that the proposed DAPH outperforms the state-of-the-art baselines in cross-modal retrieval tasks.
Fast item ranking is an important task in recommender systems. In previous works, graph-based Approximate Nearest Neighbor (ANN) approaches have demonstrated good performance on item ranking tasks with generic searching/matching measures (including complex measures such as neural network measures). However, since these ANN approaches must go through the neural measures several times during ranking, the computation is not practical if the neural measure is a large network. On the other hand, fast item ranking using existing hashing-based approaches, such as Locality Sensitive Hashing (LSH), only works with a limited set of measures, such as cosine and Euclidean distance, but not with general search measures such as neural networks. Given an arbitrary searching measure, previous learning-to-hash approaches are also not suitable to solve the fast item ranking problem since they can take a significant amount of time and computation to train the hash functions to approximate the searching measure due to a large number of possible training pairs in this problem. Hashing approaches, however, are attractive because they provide a principal and efficient way to retrieve candidate items. In this paper, we propose a simple and effective learning-to-hash approach for the fast item ranking problem that can be used to efficiently approximate any type of measure, including neural network measures. Specifically, we solve this problem with an asymmetric hashing framework based on discrete inner product fitting. We learn a pair of related hash functions that map heterogeneous objects (e.g., users and items) into a common discrete space where the inner product of their binary codes reveals their true similarity defined via the original searching measure. The fast ranking problem is reduced to an ANN search via this asymmetric hashing scheme. Then, we propose a sampling strategy to efficiently select relevant and contrastive samples to train the hashing model. We empirically validate the proposed method against the existing state-of-the-art fast item ranking methods in several combinations of non-linear searching functions and prominent datasets.
Latent factor models are the most popular backbones for today’s recommender systems owing to their prominent performance. Latent factor models represent users and items as real-valued embedding vectors for pairwise similarity computation, and all embeddings are traditionally restricted to a uniform size that is relatively large (e.g., 256-dimensional). With the exponentially expanding user base and item catalog in contemporary e commerce, this design is admittedly becoming memory-inefficient. To facilitate lightweight recommendation, reinforcement learning (RL) has recently opened up opportunities for identifying varying embedding sizes for different users/items. However, challenged by search efficiency and learning an optimal RL policy, existing RL-based methods are restricted to highly discrete, predefined embedding size choices. This leads to a largely overlooked potential of introducing finer granularity into embedding sizes to obtain better recommendation effectiveness under a given memory budget. In this paper, we propose continuous input embedding size search (CIESS), a novel RL-based method that operates on a continuous search space with arbitrary embedding sizes to choose from. In CIESS, we further present an innovative random walk-based exploration strategy to allow the RL policy to efficiently explore more candidate embedding sizes and converge to a better decision. CIESS is also model-agnostic and hence generalizable to a variety of latent factor RSs, whilst experiments on two real-world datasets have shown state-of-the-art performance of CIESS under different memory budgets when paired with three popular recommendation models.
SESSION: Session 15 – Crowdsourced and Annotation
The creation of relevance assessments by human assessors (often nowadays crowdworkers) is a vital step when building IR test collections. Prior works have investigated assessor quality & behaviour, and tooling to support assessors in their task. We have few insights though into the impact of a document’s presentation modality on assessor efficiency and effectiveness. Given the rise of voice-based interfaces, we investigate whether it is feasible for assessors to judge the relevance of text documents via a voice-based interface. We ran a user study (n = 49) on a crowdsourcing platform where participants judged the relevance of short and long documents- sampled from the TREC Deep Learning corpus-presented to them either in the text or voice modality. We found that: (i) participants are equally accurate in their judgements across both the text and voice modality; (ii) with increased document length it takes partic- ipants significantly longer (for documents of length > 120 words it takes almost twice as much time) to make relevance judgements in the voice condition; and (iii) the ability of assessors to ignore stimuli that are not relevant (i.e., inhibition) impacts the assessment quality in the voice modality-assessors with higher inhibition are significantly more accurate than those with lower inhibition. Our results indicate that we can reliably leverage the voice modality as a means to effectively collect relevance labels from crowdworkers.
Label aggregation (LA) is the task of inferring a high-quality label for an example from multiple noisy labels generated by either human annotators or model predictions. Existing work on LA assumes a label generation process and designs a probabilistic graphical model (PGM) to learn latent true labels from observed crowd labels. However, the performance of PGM-based LA models is easily affected by the noise of the crowd labels. As a consequence, the performance of LA models differs on different datasets and no single LA model outperforms the rest on all datasets.
We extend PGM-based LA models by integrating a GP prior on the true labels. The advantage of LA models extended with a GP prior is that they can take as input crowd labels, example features, and existing pre-trained label prediction models to infer the true labels, while the original LA can only leverage crowd labels. Experimental results on both synthetic and real datasets show that any LA models extended with a GP prior and a suitable mean function achieves better performance than the underlying LA models, demonstrating the effectiveness of using a GP prior.
Serendipity is a notion that means an unexpected but valuable discovery. Due to its elusive and subjective nature, serendipity is difficult to study even with today’s advances in machine learning and deep learning techniques. Both ground truth data collecting and model developing are the open research questions. This paper addresses both the data and the model challenges for identifying serendipity in recommender systems. For the ground truth data collecting, it proposes a new and scalable approach by using both user generated reviews and a crowd sourcing method. The result is a large-scale ground truth data on serendipity. For model developing, it designed a self-enhanced module to learn the fine-grained facets of serendipity in order to mitigate the inherent data sparsity problem in any serendipity ground truth dataset. The self-enhanced module is general enough to be applied with many base deep learning models for serendipity. A series of experiments have been conducted. As the result, a base deep learning model trained on our collected ground truth data, as well as with the help of the self-enhanced module, outperforms the state-of-the-art baseline models in predicting serendipity.
Dataset Preparation for Arbitrary Object Detection: An Automatic Approach based on Web Information in English
Automatic dataset preparation can help users avoid labor-intensive and costly manual data annotations. The difficulty in preparing a high-quality dataset for object detection involves three key aspects: relevance, naturality, and balance, which are not addressed by existing works. In this paper, we leverage information from the web, and propose a fully-automatic dataset preparation mechanism without any human annotation, which can automatically prepare a high-quality training dataset for the detection task with English text terms describing target objects. It contains three key designs, i.e., keyword expansion, data de-noising, and data balancing. Our experiments demonstrate that the object detectors trained with auto-prepared data are comparable to those trained with benchmark datasets and outperform other baselines. We also demonstrate the effectiveness of our approach in several more challenging real-world object categories that are not included in the benchmark datasets.
InceptionXML: A Lightweight Framework with Synchronized Negative Sampling for Short Text Extreme Classification
Automatic annotation of short-text data to a large number of target labels, referred to as Short Text Extreme Classification, has found numerous applications including prediction of related searches and product recommendation. In this paper, we propose a convolutional architecture InceptionXML which is light-weight, yet powerful, and robust to the inherent lack of word-order in short-text queries encountered in search and recommendation. We demonstrate the efficacy of applying convolutions by recasting the operation along the embedding dimension instead of the word dimension as applied in conventional CNNs for text classification. Towards scaling our model to datasets with millions of labels, we also propose SyncXML pipeline which improves upon the shortcomings of the recently proposed dynamic hard-negative mining technique for label shortlisting by synchronizing the label-shortlister and extreme classifier. SyncXML not only reduces the inference time to half but is also an order of magnitude smaller than state-of-the-art Astec in terms of model size. Through a comprehensive empirical comparison, we show that not only can InceptionXML outperform existing approaches on benchmark datasets but also the transformer baselines requiring only 2% FLOPs. The code for InceptionXML is available at https://github.com/xmc-aalto.
SESSION: Session 16 – Question Answering
There has been a growing interest in cross-source searching to gain rich knowledge in recent years. A data lake collects massive raw and heterogeneous data with different data schemas and query interfaces. Many real-life applications require query answering over the heterogeneous data lake, such as e-commerce, bioinformatics and healthcare. In this paper, we propose LakeAns that semantically integrates heterogeneous data schemas of the lake to enhance the semantics of query answers. To this end, we propose a novel framework to efficiently and effectively perform the cross-source searching. The framework exploits a reinforcement learning method to semantically integrate the data schemas and further create a global relational schema for the heterogeneous data. It then performs a query answering algorithm based on the global schema to find answers across multiple data sources. We conduct extensive experimental evaluations using real-life data to verify that our approach outperforms existing solutions in terms of effectiveness and efficiency.
BeamQA: Multi-hop Knowledge Graph Question Answering with Sequence-to-Sequence Prediction and Beam Search
Knowledge Graph Question Answering (KGQA) is a task that aims to answer natural language queries by extracting facts from a knowledge graph. Current state-of-the-art techniques for KGQA rely on text-based information from graph entity and relations labels, as well as external textual corpora. By reasoning over multiple edges in the graph, these can accurately rank and return the most relevant entities. However, one of the limitations of these methods is that they cannot handle the inherent incompleteness of real-world knowledge graphs and may lead to inaccurate answers due to missing edges. To address this issue, recent advances in graph representation learning have led to the development of systems that can use link prediction techniques to handle missing edges probabilistically, allowing the system to reason with incomplete information. However, existing KGQA frameworks that use such techniques often depend on learning a transformation from the query representation to the graph embedding space, which requires access to a large training dataset. We present BeamQA, an approach that overcomes these limitations by combining a sequence-to-sequence prediction model with beam search execution in the embedding space. Our model uses a pre-trained large language model and synthetic question generation. Our experiments demonstrate the effectiveness of BeamQA when compared to other KGQA methods on two knowledge graph question-answering datasets.
Machine reading comprehension requires systems to understand the given passage and answer questions. Previous methods mainly focus on the interaction between the question and passage. However, they ignore the deep exploration of cognitive elements behind questions, such as fine-grained reading skills (this paper focuses on narrative comprehension skills) and implicitness or explicitness of the question (whether the answer can be found in the passage). Grounded in prior literature on reading comprehension, the understanding of a question is a complex process where human beings need to understand the semantics of the question, use different reading skills for different questions, and then judge the implicitness of the question. To this end, a simple but effective Leader-Generator Network is proposed to explicitly separate and extract fine-grained reading skills and the implicitness or explicitness of the question. Specifically, the proposed skill leader accurately captures the semantic representation of fine-grained reading skills with contrastive learning. And the implicitness-aware pointer-generator adaptively extracts or generates the answer based on the implicitness or explicitness of the question. Furthermore, to validate the generalizability of the methodology, we annotate a new dataset named NarrativeQA 1.1. Experiments on the FairytaleQA and NarrativeQA 1.1 show that the proposed model achieves the state-of-the-art performance (about 5% gain on Rouge-L) on the question answering task. Our annotated data and code are available at https://github.com/pengwei-iie/Leader-Generator-Net.
SESSION: Session 17 – Temporal & Dynamic – Part 1
Unsupervised discovery of stories with correlated news articles in real-time helps people digest massive news streams without expensive human annotations. A common approach of the existing studies for unsupervised online story discovery is to represent news articles with symbolic- or graph-based embedding and incrementally cluster them into stories. Recent large language models are expected to improve the embedding further, but a straightforward adoption of the models by indiscriminately encoding all information in articles is ineffective to deal with text-rich and evolving news streams. In this work, we propose a novel thematic embedding with an off-the-shelf pretrained sentence encoder to dynamically represent articles and stories by considering their shared temporal themes. To realize the idea for unsupervised online story discovery, a scalable framework USTORY is introduced with two main techniques, theme- and time-aware dynamic embedding and novelty-aware adaptive clustering, fueled by lightweight story summaries. A thorough evaluation with real news data sets demonstrates that USTORY achieves higher story discovery performances than baselines while being robust and scalable to various streaming settings.
Time is an important aspect of documents and is used in a range of NLP and IR tasks. In this work, we investigate methods for incorporating temporal information during pre-training to further improve the performance on time-related tasks. Compared with common pre-trained language models like BERT which utilize synchronic document collections (e.g., BookCorpus and Wikipedia) as the training corpora, we use long-span temporal news article collection for building word representations. We introduce BiTimeBERT, a novel language representation model trained on a temporal collection of news articles via two new pre-training tasks, which harnesses two distinct temporal signals to construct time-aware language representations. The experimental results show that BiTimeBERT consistently outperforms BERT and other existing pre-trained models with substantial gains on different downstream NLP tasks and applications for which time is of importance (e.g., the accuracy improvement over BERT is 155% on the event time estimation task).
Dynamic share recommendation, which aims at recommending a friend who would like to share a particular item at a certain timestamp, has emerged as a novel task for social-oriented e-commerce platforms. Different from traditional graph-based recommendation tasks, with integrating the interconnected social interactions and fine-grained temporal information from historical share records, this novel task may encounter one unique challenge, i.e., how to deal with the dynamic social connections and asymmetric share interactions. Even worse, users may keep inactive during some periods, which results in difficulties in updating personalized profiles. To address the above challenges, in this paper, we propose a dynamic graph share recommendation model called DynShare. Specifically, we first divide each user embedding into two parts, namely the invitation embedding and vote embedding to show the tendencies of sending and receiving items, respectively. Then, temporal graph attention networks (TGATs) based on bi-directional continuous time dynamic graphs (CTDGs) are leveraged to encode temporal neighbor information from different directions. Afterward, to estimate how different users perceive the time intervals after the last interaction, we further design a time-interval aware personalized projection operator on the foundation of temporal point processes (TPPs) to project user embedding for the next-time share prediction. Extensive experiments on a real-world e-commerce share dataset have demonstrated that our proposed DynShare can achieve better results compared with state-of-the-art baseline methods. And our code is available on the project website: https://github.com/meteor-gif/DynShare.
Generative models such as Generative Adversarial Networks (GANs) and Variational Auto-Encoders (VAEs) are widely utilized to model the generative process of user interactions. However, they suffer from intrinsic limitations such as the instability of GANs and the restricted representation ability of VAEs. Such limitations hinder the accurate modeling of the complex user interaction generation procedure, such as noisy interactions caused by various interference factors. In light of the impressive advantages of Diffusion Models (DMs) over traditional generative models in image synthesis, we propose a novel Diffusion Recommender Model (named DiffRec) to learn the generative process in a denoising manner. To retain personalized information in user interactions, DiffRec reduces the added noises and avoids corrupting users’ interactions into pure noises like in image synthesis. In addition, we extend traditional DMs to tackle the unique challenges in recommendation: high resource costs for large-scale item prediction and temporal shifts of user preference. To this end, we propose two extensions of DiffRec: L-DiffRec clusters items for dimension compression and conducts the diffusion processes in the latent space; and T-DiffRec reweights user interactions based on the interaction timestamps to encode temporal information. We conduct extensive experiments on three datasets under multiple settings (e.g., clean training, noisy training, and temporal training). The empirical results validate the superiority of DiffRec with two extensions over competitive baselines.
Understand the Dynamic World: An End-to-End Knowledge Informed Framework for Open Domain Entity State Tracking
Open domain entity state tracking aims to predict reasonable state changes of entities (i.e., [attribute] of [entity] was [before_state] and [after_state] afterwards) given the action descriptions. It’s important to many reasoning tasks to support human everyday activities. However, it’s challenging as the model needs to predict an arbitrary number of entity state changes caused by the action while most of the entities are implicitly relevant to the actions and their attributes as well as states are from open vocabularies. To tackle these challenges, we propose a novel end-to-end Knowledge Informed framework for open domain Entity State Tracking, namely KIEST, which explicitly retrieves the relevant entities and attributes from external knowledge graph (i.e., ConceptNet) and incorporates them to autoregressively generate all the entity state changes with a novel dynamic knowledge grained encoder-decoder framework. To enforce the logical coherence among the predicted entities, attributes, and states, we design a new constraint decoding strategy and employ a coherence reward to improve the decoding process. Experimental results show that our proposed KIEST framework significantly outperforms the strong baselines on the public benchmark dataset – OpenPI
SESSION: Session 18 – Knowledge Graphs – Part 2
Knowledge graph embedding (KGE) aims to project both entities and relations in a knowledge graph (KG) into low-dimensional vectors. Indeed, existing KGs suffer from the data imbalance issue, i.e., entities and relations conform to a long-tail distribution, only a small portion of entities and relations occur frequently, while the vast majority of entities and relations only have a few training samples. Existing KGE methods assign equal weights to each entity and relation during the training process. Under this setting, long-tail entities and relations are not fully trained during training, leading to unreliable representations. In this paper, we propose WeightE, which attends differentially to different entities and relations. Specifically, WeightE is able to endow lower weights to frequent entities and relations, and higher weights to infrequent ones. In such manner, WeightE is capable of increasing the weights of long-tail entities and relations, and learning better representations for them. In particular, WeightE tailors bilevel optimization for the KGE task, where the inner level aims to learn reliable entity and relation embeddings, and the outer level attempts to assign appropriate weights for each entity and relation. Moreover, it is worth noting that our technique of applying weights to different entities and relations is general and flexible, which can be applied to a number of existing KGE models. Finally, we extensively validate the superiority of WeightE against various state-of-the-art baselines.
Relation-Aware Multi-Positive Contrastive Knowledge Graph Completion with Embedding Dimension Scaling
Recently, a large amount of work has emerged for knowledge graph completion (KGC), which aims to reason over known facts and to infer the missing links. Meanwhile, contrastive learning has been applied to the KGC tasks, which can improve the representation quality of entities and relations. However, existing KGC approaches tend to improve their performance with high-dimensional embeddings and complex models, which make them suffer from large storage space and high training costs. Furthermore, contrastive loss with single positive sample learns little structural and semantic information in knowledge graphs due to the complex relation types. To address these challenges, we propose a novel knowledge graph completion model named ConKGC with the embedding dimension scaling and a relation-aware multi-positive contrastive loss. In order to achieve both space consumption reduction and model performance improvement, a new scoring function is proposed to map the raw low-dimensional embeddings of entities and relations to high-dimensional embedding space, and predict low-dimensional tail entities with latent semantic information of high-dimensional embeddings. In addition, ConKGC designs a multiple weak positive samples based contrastive loss under different relation types to maintain two important training targets, Alignment and Uniformity. This loss function and few parameters of the model ensure that ConKGC performs best and has fast convergence speed. Extensive experiments on three standard datasets confirm the effectiveness of our innovations, and the performance of ConKGC is significantly improved compared to the state-of-the-art methods.
Incorporating Structured Sentences with Time-enhanced BERT for Fully-inductive Temporal Relation Prediction
Temporal relation prediction in incomplete temporal knowledge graphs (TKGs) is a popular temporal knowledge graph completion (TKGC) problem in both transductive and inductive settings. Traditional embedding-based TKGC models (TKGE) rely on structured connections and can only handle a fixed set of entities, i.e., the transductive setting. In the inductive setting where test TKGs contain emerging entities, the latest methods are based on symbolic rules or pre-trained language models (PLMs). However, they suffer from being inflexible and not time-specific, respectively. In this work, we extend the fully-inductive setting, where entities in the training and test sets are totally disjoint, into TKGs and take a further step towards a more flexible and time-sensitive temporal relation prediction approach SST-BERT,incorporating Structured Sentences with Time-enhanced BERT. Our model can obtain the entity history and implicitly learn rules in the semantic space by encoding structured sentences, solving the problem of inflexibility. We propose to use a time masking MLM task to pre-train BERT in a corpus rich in temporal tokens specially generated for TKGs, enhancing the time sensitivity of SST-BERT. To compute the probability of occurrence of a target quadruple, we aggregate all its structured sentences from both temporal and semantic perspectives into a score. Experiments on the transductive datasets and newly generated fully-inductive benchmarks show that SST-BERT successfully improves over state-of-the-art baselines.
Knowledge graphs (KGs), as a structured form of knowledge representation, have been widely applied in the real world. Recently, few-shot knowledge graph completion (FKGC), which aims to predict missing facts for unseen relations with few-shot associated facts, has attracted increasing attention from practitioners and researchers. However, existing FKGC methods are based on metric learning or meta-learning, which often suffer from the out-of-distribution and overfitting problems. Meanwhile, they are incompetent at estimating uncertainties in predictions, which is critically important as model predictions could be very unreliable in few-shot settings. Furthermore, most of them cannot handle complex relations and ignore path information in KGs, which largely limits their performance. In this paper, we propose a normalizing flow-based neural process for few-shot knowledge graph completion (NP-FKGC). Specifically, we unify normalizing flows and neural processes to model a complex distribution of KG completion functions. This offers a novel way to predict facts for few-shot relations while estimating the uncertainty. Then, we propose a stochastic ManifoldE decoder to incorporate the neural process and handle complex relations in few-shot settings. To further improve performance, we introduce an attentive relation path-based graph neural network to capture path information in KGs. Extensive experiments on three public datasets demonstrate that our method significantly outperforms the existing FKGC methods and achieves state-of-the-art performance. Code is available at https://github.com/RManLuo/NP-FKGC.git.
With the development of pre-trained language models, many prompt-based approaches to data-efficient knowledge graph construction have been proposed and achieved impressive performance. However, existing prompt-based learning methods for knowledge graph construction are still susceptible to several potential limitations: (i) semantic gap between natural language and output structured knowledge with pre-defined schema, which means model cannot fully exploit semantic knowledge with the constrained templates; (ii) representation learning with locally individual instances limits the performance given the insufficient features, which are unable to unleash the potential analogical capability of pre-trained language models. Motivated by these observations, we propose a retrieval-augmented approach, which retrieves schema-aware Reference As Prompt (RAP), for data-efficient knowledge graph construction. It can dynamically leverage schema and knowledge inherited from human-annotated and weak-supervised data as a prompt for each sample, which is model-agnostic and can be plugged into widespread existing approaches. Experimental results demonstrate that previous methods integrated with RAP can achieve impressive performance gains in low-resource settings on five datasets of relational triple extraction and event extraction for knowledge graph construction Code is available in https://github.com/zjunlp/RAP.
SESSION: Session 19 – Implicit Feedback and Sessions
Learning reinforcement learning (RL)-based recommenders from historical user-item interaction sequences is vital to generate high-reward recommendations and improve long-term cumulative benefits. However, existing RL recommendation methods encounter difficulties (i) to estimate the value functions for states which are not contained in the offline training data, and (ii) to learn effective state representations from user implicit feedback due to the lack of contrastive signals.
In this work, we propose contrastive state augmentations (CSA) for the training of RL-based recommender systems. To tackle the first issue, we propose four state augmentation strategies to enlarge the state space of the offline data. The proposed method improves the generalization capability of the recommender by making the RL agent visit the local state regions and ensuring the learned value functions are similar between the original and augmented states. For the second issue, we propose introducing contrastive signals between augmented states and the state randomly sampled from other sessions to improve the state representation learning further.
To verify the effectiveness of the proposed CSA, we conduct extensive experiments on two publicly accessible datasets and one dataset collected from a real-life e-commerce platform. We also conduct experiments on a simulated environment as the online evaluation setting. Experimental results demonstrate that CSA can effectively improve recommendation performance.
Recommender systems that learn from implicit feedback often use large volumes of a single type of implicit user feedback, such as clicks, to enhance the prediction of sparse target behavior such as purchases. Using multiple types of implicit user feedback for such target behavior prediction purposes is still an open question. Existing studies that attempted to learn from multiple types of user behavior often fail to: (i) learn universal and accurate user preferences from different behavioral data distributions, and (ii) overcome the noise and bias in observed implicit user feedback.
To address the above problems, we propose multi-behavior alignment (MBA), a novel recommendation framework that learns from implicit feedback by using multiple types of behavioral data. We conjecture that multiple types of behavior from the same user (e.g., clicks and purchases) should reflect similar preferences of that user. To this end, we regard the underlying universal user preferences as a latent variable. The variable is inferred by maximizing the likelihood of multiple observed behavioral data distributions and, at the same time, minimizing the Kullback?Leibler divergence (KL-divergence) between user models learned from auxiliary behavior (such as clicks or views) and the target behavior separately. MBA infers universal user preferences from multi-behavior data and performs data denoising to enable effective knowledge transfer. We conduct experiments on three datasets, including a dataset collected from an operational e-commerce platform. Empirical results demonstrate the effectiveness of our proposed method in utilizing multiple types of behavioral data to enhance the prediction of the target behavior.
In recent years, neural models have been repeatedly touted to exhibit state-of-the-art performance in recommendation. Nevertheless, multiple recent studies have revealed that the reported state-of-the-art results of many neural recommendation models cannot be reliably replicated. A primary reason is that existing evaluations are performed under various inconsistent protocols. Correspondingly, these replicability issues make it difficult to understand how much benefit we can actually gain from these neural models. It then becomes clear that a fair and comprehensive performance comparison between traditional and neural models is needed.
Motivated by these issues, we perform a large-scale, systematic study to compare recent neural recommendation models against traditional ones in top-n recommendation from implicit data. We propose a set of evaluation strategies for measuring memorization performance, generalization performance, and subgroup-specific performance of recommendation models. We conduct extensive experiments with 13 popular recommendation models (including two neural models and 11 traditional ones as baselines) on nine commonly used datasets. Our experiments demonstrate that even with extensive hyper-parameter searches, neural models do not dominate traditional models in all aspects, e.g., they fare worse in terms of average HitRate. We further find that there are areas where neural models seem to outperform non-neural models, for example, in recommendation diversity and robustness between different subgroups of users and items. Our work illuminates the relative advantages and disadvantages of neural models in recommendation and is therefore an important step towards building better recommender systems.
Session search is a widely adopted technique in search engines that seeks to leverage the complete interaction history of a search session to better understand the information needs of users and provide more relevant ranking results. The vast majority of existing methods model a search session as a sequence of queries and previously clicked documents. However, if we simply represent a search session as a sequence we will lose the topological information in the original search session. It is non-trivial to model the intra-session interactions and complicated structural patterns among the previously issued queries, clicked documents, as well as the terms or entities that appeared in them. To solve this problem, in this paper, we propose a novel Session Search with Graph Classification Model (SSGC), which regards session search as a graph classification task on a heterogeneous graph that represents the search history in each session. To improve the performance of the graph classification, we design a specific pre-training strategy for our proposed GNN-based classification model. Extensive experiments on two public session search datasets demonstrate the effectiveness of our model in the session search task.
In this paper, we study the unsupervised plain graph alignment problem, which aims to find node correspondences across two graphs without any side information. The majority of previous works addressed UPGA based on structural information, which will inevitably lead to subgraph isomorphism issues. That is, unaligned nodes could take similar local structural information. To mitigate this issue, we present the Multi-order Matched Neighborhood Consistent (MMNC) which tries to match nodes by aligning the learned node embeddings with only a small number of pseudo alignment seeds. In particular, we extend matched neighborhood consistency (MNC) to vector space and further develop embedding-based MNC (EMNC). By minimizing the EMNC-based loss function, we can utilize the limited pseudo alignment seeds to approximate the orthogonal transformation matrix between two groups of node embeddings with high efficiency and accuracy. Through extensive experiments on public benchmarks, we show that the proposed methods achieve a good balance between alignment accuracy and speed over multiple datasets compared with existing methods.
SESSION: Session 20 – Personalized Models
Relation classification detects the semantic relation between two annotated entities from a piece of text, which is a useful tool for structurization of knowledge. Recently, federated learning has been introduced to train relation classification models in decentralized settings. Current methods strive for a strong server model by decoupling the model training at server from direct access to texts at clients while taking advantage of them. Nevertheless, they overlook the fact that clients have heterogeneous texts (i.e., texts with diversely skewed distribution of relations), which renders existing methods less practical. In this paper, we propose to investigate personalized federated relation classification, in which strong client models adapted to their own data are desired. To further meet the challenges brought by heterogeneous texts, we present a novel framework, namely pf-RC, with several optimized designs. It features a knowledge aggregation method that exploits a relation-wise weighting mechanism, and a feature augmentation method that leverages prototypes to adaptively enhance the representations of instances of long-tail relations. We experimentally validate the superiority of pf-RC against competing baselines in various settings, and the results suggest that the tailored techniques mitigate the challenges.
Cognitive diagnosis (CD) aims to reveal the proficiency of students on specific knowledge concepts and traits of test exercises (e.g., difficulty). It plays a critical role in intelligent education systems by supporting personalized learning guidance. However, recent developments in CD mostly concentrate on improving the accuracy of diagnostic results and often overlook the important and practical task: domain-level zero-shot cognitive diagnosis (DZCD). The primary challenge of DZCD is the deficiency of student behavior data in the target domain due to the absence of student-exercise interactions or unavailability of exercising records for training purposes. To tackle the cold-start issue, we propose a two-stage solution named TechCD (Transferable knowledgE Concept grapH embedding framework for Cognitive Diagnosis). The fundamental notion involves utilizing a pedagogical knowledge concept graph (KCG) as a mediator to connect disparate domains, allowing the transmission of student cognitive signals from established domains to the zero-shot cold-start domain. Specifically, a naive yet effective graph convolutional network (GCN) with the bottom-layer discarding operation is initially employed over the KCG to learn transferable student cognitive states and domain-specific exercise traits. Moreover, we give three implementations of the general TechCD framework following the typical cognitive diagnosis solutions. Finally, extensive experiments on real-world datasets not only prove that Tech can effectively perform zero-shot diagnosis, but also give some popular applications such as exercise recommendation.
Methods for making high-quality recommendations often rely on learning latent representations from interaction data. These methods, while performant, do not provide ready mechanisms for users to control the recommendation they receive. Our work tackles this problem by proposing LACE, a novel concept value bottleneck model for controllable text recommendations. LACE represents each user with a succinct set of human-readable concepts through retrieval given user-interacted documents and learns personalized representations of the concepts based on user documents. This concept based user profile is then leveraged to make recommendations. The design of our model affords control over the recommendations through a number of intuitive interactions with a transparent user profile. We first establish the quality of recommendations obtained from LACE in an offline evaluation on three recommendation tasks spanning six datasets in warm-start, cold-start, and zero-shot setups. Next, we validate the controllability of LACE under simulated user interactions. Finally, we implement LACE in an interactive controllable recommender system and conduct a user study to demonstrate that users are able to improve the quality of recommendations they receive through interactions with an editable user profile.
Ranking ensemble is a critical component in real recommender systems. When a user visits a platform, the system will prepare several item lists, each of which is generally from a single behavior objective recommendation model. As multiple behavior intents, e.g., both clicking and buying some specific item category, are commonly concurrent in a user visit, it is necessary to integrate multiple single-objective ranking lists into one. However, previous work on rank aggregation mainly focused on fusing homogeneous item lists with the same objective while ignoring ensemble of heterogeneous lists ranked with different objectives with various user intents.
In this paper, we treat a user’s possible behaviors and the potential interacting item categories as the user’s intent. And we aim to study how to fuse candidate item lists generated from different objectives aware of user intents. To address such a task, we propose an Intent-aware ranking Ensemble Learning (IntEL) model to fuse multiple single-objective item lists with various user intents, in which item-level personalized weights are learned. Furthermore, we theoretically prove the effectiveness of IntEL with point-wise, pair-wise, and list-wise loss functions via error-ambiguity decomposition. Experiments on two large-scale real-world datasets also show significant improvements of IntEL on multiple behavior objectives simultaneously compared to previous ranking ensemble models.
Personalized retrieval seeks to retrieve items relevant to a user event (e.g. a page visit or a query) that are adapted to the user’s personal preferences. For example, two users who happen to perform the same event such as visiting the same product page or asking the same query should receive potentially distinct recommendations adapted to their individual tastes. Personalization is seldom attempted over catalogs of millions of items since the cost of existing personalization routines scale linearly in the number of candidate items. For example, performing two-sided personalized retrieval (with both event and item embeddings personalized to the user) incurs prohibitive storage and compute costs. Instead, it is common to use non-personalized retrieval to obtain a small shortlist of items over which personalized re-ranking can be done quickly. Despite being scalable, this strategy risks losing items uniquely relevant to a user that fail to get shortlisted during non-personalized retrieval. This paper bridges this gap by developing the XPERT algorithm that identifies a form of two-sided personalization that can be scalably implemented over millions of items and hundreds of millions of users. Key to overcoming the computational challenges of personalized retrieval is a novel concept of morph operators that can be used with arbitrary encoder architectures, completely avoids the steep memory overheads of two-sided personalization, provides millisecond-time inference and offers multi-intent retrieval. On multiple public and proprietary datasets, XPERT offered upto 5% superior recall and AUC than state-of-the-art techniques. Code for XPERT is available at https://github.com/personalizedretrieval/xpert.
SESSION: Session 21 – Legal, Medicine & Health
Legal judgment prediction (LJP) is a significant task in legal intelligence, which aims to assist the judges and determine the judgment result based on the case’s fact description. The judgment result consists of law articles, charge, and prison term. The law articles serve as the basis for the charge and the prison term, which can be divided into two types, named as charge-related law article and term-related law article, respectively. Recently, many methods have been proposed and made tremendous progress in LJP. However, the existing methods only focus on the prediction of the charge-related law articles, ignoring the term-related law articles (e.g., laws about lenient treatment), which limits the performance in the prison term prediction. In this paper, following the actual legal process, we expand the law article prediction as a multi-label classification task that includes both the charge-related law articles and term-related law articles and propose a novel multi-law aware LJP (ML-LJP) method to improve the performance of LJP. Given the case’s fact description, firstly, the label (e.g., law article and charge) definitions in the Code of Law are used to transform the representation of the fact into several label-specific representations and make the prediction of the law articles and the charge. To distinguish the similar content of different label definitions, contrastive learning is conducted in the training. Then, a graph attention network (GAT) is applied to learn the interactions among the multiple law articles for the prediction of the prison term. Since numbers (e.g., amount of theft and weight of drugs) are important for LJP but often ignored by conventional encoders, we design a corresponding number representation method to locate and better represent these effective numbers. Extensive experiments on real-world dataset show that our method achieves the best results compared to the state-of-the-art models, especially in the task of prison term prediction where ML-LJP achieves a 10.07% relative improvement over the best baseline.
Legal case retrieval, which aims to find relevant cases for a query case, plays a core role in the intelligent legal system. Despite the success that pre-training has achieved in ad-hoc retrieval tasks, effective pre-training strategies for legal case retrieval remain to be explored. Compared with general documents, legal case documents are typically long text sequences with intrinsic logical structures. However, most existing language models have difficulty understanding the long-distance dependencies between different structures. Moreover, in contrast to the general retrieval, the relevance in the legal domain is sensitive to key legal elements. Even subtle differences in key legal elements can significantly affect the judgement of relevance. However, existing pre-trained language models designed for general purposes have not been equipped to handle legal elements.
To address these issues, in this paper, we propose SAILER, a new Structure-Aware pre-traIned language model for LEgal case Retrieval. It is highlighted in the following three aspects: (1) SAILER fully utilizes the structural information contained in legal case documents and pays more attention to key legal elements, similar to how legal experts browse legal case documents. (2) SAILER employs an asymmetric encoder-decoder architecture to integrate several different pre-training objectives. In this way, rich semantic information across tasks is encoded into dense vectors. (3) SAILER has powerful discriminative ability, even without any legal annotation data. It can distinguish legal cases with different charges accurately. Extensive experiments over publicly available legal benchmarks demonstrate that our approach can significantly outperform previous state-of-the-art methods in legal case retrieval.
Patents are legal documents that aim at protecting inventions on the one hand and at making technical knowledge circulate on the other. Their complex style — a mix of legal, technical, and extremely vague language — makes their content hard to access for humans and machines and poses substantial challenges to the information retrieval community. This paper proposes an approach to automatically simplify patent text through rephrasing. Since no in-domain parallel simplification data exist, we propose a method to automatically generate a large-scale silver standard for patent sentences. To obtain candidates, we use a general-domain paraphrasing system; however, the process is error-prone and difficult to control. Thus, we pair it with proper filters and construct a cleaner corpus that can successfully be used to train a simplification system. Human evaluation of the synthetic silver corpus shows that it is considered grammatical, adequate, and contains simple sentences.
Not Just Skipping: Understanding the Effect of Sponsored Content on Users’ Decision-Making in Online Health Search
Advertisements (ads) are an innate part of search engine business models. To promote sales, advertisers are willing to pay search engines to promote their content to a prominent position in the search result page (SERP). This raises concerns about the search engine manipulation effect (SEME): the opinions of users can be influenced by the way search results are presented
In this work, we investigate the connection between SEME and sponsored content in the health domain. We conduct a series of user studies in which participants need to evaluate the effectiveness of different non-prescription natural remedies for various medical conditions. We present participants SERPs with different intentionally created biases towards certain viewpoints, with or without sponsored content, and ask them to evaluate the effectiveness of the treatment solely based on the information presented to them. We investigate two types of sponsored content: 1). Direct marketing ads that directly market the product without expressing an opinion about its effectiveness; and 2). Indirect marketing ads that explicitly advocate the product’s effectiveness on the condition in the query. Our results reveal a significant difference between the influence on users from these two types of ads. Though users mostly skip direct marketing ads, they do sometimes tilt users’ decision-making. Indirect marketing ads affect both the users’ examination behavior and their perception of the treatment’s effectiveness. We further discover that the contrast between the indirect marketing ads and the viewpoint presented in the organic search results plays an important role in users’ decision-making. When the contrast is high, users exhibit a strong preference towards a negative viewpoint, and when the contrast is low or none, users exhibit a preference toward a more positive viewpoint.
Contrastive opinion extraction aims to extract a structured summary or key points organised as positive and negative viewpoints towards a common aspect or topic. Most recent works for unsupervised key point extraction is largely built on sentence clustering or opinion summarisation based on the popularity of opinions expressed in text. However, these methods tend to generate aspect clusters with incoherent sentences, conflicting viewpoints, redundant aspects. To address these problems, we propose a novel unsupervised Contrastive OpinioN Extraction model, called Cone, which learns disentangled latent aspect and sentiment representations based on pseudo aspect and sentiment labels by combining contrastive learning with iterative aspect/sentiment clustering refinement. Apart from being able to extract contrastive opinions, it is also able to quantify the relative popularity of aspects and their associated sentiment distributions. The model has been evaluated on both a hotel review dataset and a Twitter dataset about COVID vaccines. The results show that despite using no label supervision or aspect-denoted seed words, Cone outperforms a number of competitive baselines on contrastive opinion extraction. The results of Cone can be used to offer a better recommendation of products and services online.
SESSION: Session 22 – Collaborative Filtering
Graph collaborative filtering has achieved great success in capturing users’ preferences over items. Despite effectiveness, graph neural network (GNN)-based methods suffer from data sparsity in real scenarios. Recently, contrastive learning (CL) has been used to address the problem of data sparsity. However, most CL-based methods only leverage the original user-item interaction graph to construct the CL task, lacking the explicit exploitation of the higher-order information (i.e., user-user and item-item relationships). Even for the CL-based method that uses the higher-order information, the reception field of the higher-order information is fixed and regardless of the difference between nodes. In this paper, we propose a novel adaptive multi-view fusion contrastive learning framework, named AdaMCL, for graph collaborative filtering. To exploit the higher-order information more accurately, we propose an adaptive fusion strategy to fuse the embeddings learned from the user-item and user-user graphs. Moreover, we propose a multi-view fusion contrastive learning paradigm to construct effective CL tasks. Besides, to alleviate the noisy information caused by aggregating higher-order neighbors, we propose a layer-level CL task. Extensive experimental results reveal that AdaMCL is effective and outperforms existing collaborative filtering models significantly.
In dynamic interaction graphs, user-item interactions usually follow heterogeneous patterns, represented by different structural information, such as user-item co-occurrence, sequential information of user interactions and the transition probabilities of item pairs. However, the existing methods cannot simultaneously leverage all three structural information, resulting in suboptimal performance. To this end, we propose øurs, a triple structural information modeling method for accurate, explainable and interactive recommendation on dynamic interaction graphs. Specifically, øurs consists of 1) a dynamic ideal low-pass graph filter to dynamically mine co-occurrence information in user-item interactions, which is implemented by incremental singular value ecomposition (SVD); 2) a parameter-free attention module to capture sequential information of user interactions effectively and efficiently; and 3) an item transition matrix to store the transition probabilities of item pairs. Then, we fuse the predictions from the triple structural information sources to obtain the final recommendation results. By analyzing the relationship between the SVD-based and the recently emerging graph signal processing (GSP)-based collaborative filtering methods, we find that the essence of SVD is an ideal low-pass graph filter, so that the interest vector space in øurs can be extended to achieve explainable and interactive recommendation, making it possible for users to actively break through the information cocoons. Experiments on six public datasets demonstrated the effectiveness of øurs in accuracy, explainability and interactivity.
Collaborative filtering is one of the most fundamental topics for recommender systems. Various methods have been proposed for collaborative filtering, ranging from matrix factorization to graph convolutional methods. Being inspired by recent successes of graph filtering-based methods and score-based generative models (SGMs), we present a novel concept of blurring-sharpening process model (BSPM). SGMs and BSPMs share the same processing philosophy that new information can be discovered (e.g., new images are generated in the case of SGMs) while original information is first perturbed and then recovered to its original form. However, SGMs and our BSPMs deal with different types of information, and their optimal perturbation and recovery processes have fundamental discrepancies. Therefore, our BSPMs have different forms from SGMs. In addition, our concept not only theoretically subsumes many existing collaborative filtering models but also outperforms them in terms of Recall and NDCG in the three benchmark datasets, Gowalla, Yelp2018, and Amazon-book. In addition, the processing time of our method is comparable to other fast baselines. Our proposed concept has much potential in the future to be enhanced by designing better blurring (i.e., perturbation) and sharpening (i.e., recovery) processes than what we use in this paper. Our code is available at https://github.com/jeongwhanchoi/BSPM.
In collaborative filtering, distance metric learning has been applied to matrix factorization techniques with promising results. However, matrix factorization lacks the ability of capturing collaborative information, which has been remarked by recent works and improved by interpreting user interactions as signals. This paper aims to find out how metric learning connect to these signal-based models. By adopting a generalized distance metric, we discovered that in signal-based models, it is easier to estimate the residual of distances, which refers to the difference between the distances from a user to a target item and another item, rather than estimating the distances themselves. Further analysis also uncovers a link between the normalization strength of interaction signals and the novelty of recommendation, which has been overlooked by existing studies. Based on the above findings, we propose a novel model to learn a generalized distance user-item distance metric to capture user preference in interaction signals by modeling the residuals of distance. The proposed CoRML model is then further improved in training efficiency by a newly introduced approximated ranking weight. Extensive experiments conducted on 4 public datasets demonstrate the superior performance of CoRML compared to the state-of-the-art baselines in collaborative filtering, along with high efficiency and the ability of providing novelty-promoted recommendations, shedding new light on the study of metric learning-based recommender systems.
By treating users’ interactions as a user-item graph, graph learning models have been widely deployed in Collaborative Filtering~(CF) based recommendation. Recently, researchers have introduced Graph Contrastive Learning~(GCL) techniques into CF to alleviate the sparse supervision issue, which first constructs contrastive views by data augmentations and then provides self-supervised signals by maximizing the mutual information between contrastive views. Despite the effectiveness, we argue that current GCL-based recommendation models are still limited as current data augmentation techniques, either structure augmentation or feature augmentation. First, structure augmentation randomly dropout nodes or edges, which is easy to destroy the intrinsic nature of the user-item graph. Second, feature augmentation imposes the same scale noise augmentation on each node, which neglects the unique characteristics of nodes on the graph.
To tackle the above limitations, we propose a novel Variational Graph Generative-Contrastive Learning (VGCL) framework for recommendation. Specifically, we leverage variational graph reconstruction to estimate a Gaussian distribution of each node, then generate multiple contrastive views through multiple samplings from the estimated distributions, which builds a bridge between generative and contrastive learning. The generated contrastive views can well reconstruct the input graph without information distortion. Besides, the estimated variances are tailored to each node, which regulates the scale of contrastive loss for each node on optimization. Considering the similarity of the estimated distributions, we propose a cluster-aware twofold contrastive learning, a node-level to encourage consistency of a node’s contrastive views and a cluster-level to encourage consistency of nodes in a cluster. Finally, extensive experimental results on three public datasets clearly demonstrate the effectiveness of the proposed model.
Traditional approaches to improving users’ quality of experience (QoE) focus on minimizing the latency on the server side. Through an analysis of 15 million users, however, we find that for short-form video apps, user experience depends on both response latency and recommendation accuracy. This observation brings a dilemma to service providers since improving recommendation accuracy requires adopting complex strategies that demand heavy computation, which substantially increases response latency.
Our motivation is that users’ sensitivity to response latency and recommendation accuracy varies greatly. In other words, some users would accept a 20ms increase in latency to enjoy higher-quality videos, while others prioritize minimizing lag above all else. Inspired by this, we present Hydrus, a novel resource allocation system that delivers the best possible personalized QoE by making tradeoffs between response latency and recommendation accuracy. Specifically, we formulate the resource allocation problem as a utility maximization problem, and Hydrus is guaranteed to solve the problem within a few milliseconds.
We demonstrate the effectiveness of Hydrus through offline simulation and online experiments in Kuaishou, a massively popular video app with hundreds of millions of users worldwide. The results show that Hydrus can increase QoE by 35.6% with the same latency or reduce the latency by 10.1% with the same QoE. Furthermore, Hydrus can achieve 54.5% higher throughput without a decrease in QoE. In online A/B testing, Hydrus significantly improves click-through rate (CTR) and watch time; it can also reduce system resource costs without sacrificing QoE.
Recent studies show that graph neural networks (GNNs) are prevalent to model high-order relationships for collaborative filtering (CF). Towards this research line, graph contrastive learning (GCL) has exhibited powerful performance in addressing the supervision label shortage issue by learning augmented user and item representations. While many of them show their effectiveness, two key questions still remain unexplored: i) Most existing GCL-based CF models are still limited by ignoring the fact that user-item interaction behaviors are often driven by diverse latent intent factors (e.g., shopping for family party, preferred color or brand of products); ii) Their introduced non-adaptive augmentation techniques are vulnerable to noisy information, which raises concerns about the model’s robustness and the risk of incorporating misleading self-supervised signals. In light of these limitations, we propose a Disentangled Contrastive Collaborative Filtering framework (DCCF) to realize intent disentanglement with self-supervised augmentation in an adaptive fashion. With the learned disentangled representations with global context, our DCCF is able to not only distill finer-grained latent factors from the entangled self-supervision signals but also alleviate the augmentation-induced noise. Finally, the cross-view contrastive learning task is introduced to enable adaptive augmentation with our parameterized interaction mask generator. Experiments on various public datasets demonstrate the superiority of our method compared to existing solutions. Our model implementation is released at the link https://github.com/HKUDS/DCCF.
SESSION: Session 23 – Cold-start Recommendation and Users’ Preferences & Interests
Recommending cold items in recommendation systems is a longstanding challenge due to the inherent differences between warm items, which are recommended based on user behavior, and cold items, which are recommended based on content features. To tackle this, generative models generate synthetic embeddings from content features, while dropout models enhance the robustness of the recommendation system by randomly dropping behavioral embeddings during training. However, these models primarily focus on handling the recommendation of cold items, but do not effectively address the differences between warm and cold recommendations. As a result, generative models may over-recommend either warm or cold items, neglecting the other type, and dropout models may negatively impact warm item recommendations. To address this, we propose the Aligning Distillation (ALDI) framework, which leverages warm items as “teachers” to transfer their behavioral information to cold items, referred to as “students”. ALDI aligns the students with the teachers by comparing the differences in their recommendation characters, using tailored rating distribution aligning, ranking aligning, and identification aligning losses to narrow these differences. Furthermore, ALDI incorporates a teacher-qualifying weighting structure to prevent students from learning inaccurate information from unreliable teachers. Experiments on three datasets show that our approach outperforms state-of-the-art baselines in terms of overall, warm, and cold recommendation performance with three different recommendation backbones.
The cold-start problem is commonly encountered in recommender systems when delivering recommendations to users or items with limited interaction information and can seriously harm the performance of the system. To cope with this issue, meta-learning-based approaches have come to the rescue in recent years by enabling models to learn user preferences globally in the pre-training stage followed by local fine-tuning for a target user with only a few interactions. However, we argue that the user representation learned in this way may be inadequate to capture user preference well since solely utilizing his/her own interactions may be far from enough in cold-start scenarios. To tackle this problem, we propose a novel meta-learning method named M2EU to enrich the representations of cold-start users by incorporating the information from other similar users who are identified based on the similarity of both inherent attributes and historical interactions. In addition, we design an attention mechanism according to the variances of ratings in the aggregation of similar user embeddings. To further enhance the capability of user preference modeling, we devise different neural layers to generate user or item embeddings at the rating level and utilize the weight-sharing strategy to guarantee adequate parameters learning of neural layers in our meta-learning approach. In meta-training with mini-batching, we adopt an incremental learning scheme to learn a set of generalized parameters for all tasks. Experimental results on the public benchmark datasets demonstrate that M2EU outperforms state-of-the-art methods through extensive quantitative evaluations in various cold-start scenarios.
The issue of user cold-start poses a long-standing challenge to recommendation systems, due to the scarce interactions of new users. Recently, meta-learning based studies treat each cold-start user as a user-specific few-shot task and then derive meta-knowledge about fast model adaptation across training users. However, existing solutions mostly do not clearly distinguish the concept of new users and the concept of novel preferences, leading to over-reliance on meta-learning based adaptability to novel patterns. In addition, we also argue that the existing meta-training task construction inherently suffers from the memorization overfitting issue, which inevitably hinders meta-generalization to new users. In response to the aforementioned issues, we propose a preference learning decoupling framework, which is enhanced with meta-augmentation (PDMA), for user cold-start recommendation. To rescue the meta-learning from unnecessary adaptation to common patterns, our framework decouples preference learning for a cold-start user into two complementary aspects: common preference transfer, and novel preference adaptation. To handle the memorization overfitting issue, we further propose to augment meta-training users by injecting attribute-based noises, to achieve mutually-exclusive tasks. Extensive experiments on benchmark datasets demonstrate that our framework achieves superior performance improvements against state-of-the-art methods. We also show that our proposed framework is effective in alleviating memorization overfitting.
Exploring Scenarios of Uncertainty about the Users’ Preferences in Interactive Recommendation Systems
Interactive Recommender Systems have played a crucial role in distinct entertainment domains through a Contextual Bandit model. Despite the current advances, their personalisation level is still directly related to the information previously available about the users. However, there are at least two scenarios of uncertainty about the users’ preferences over their journey: (1) when the user joins for the first time and (2) when the system continually makes wrong recommendations because of prior misleading assumptions. In this work, we introduce concepts from the Active Learning theory to mitigate the impact of such scenarios. We modify three traditional bandits to recommend items with a higher potential to get more user information without decreasing the model’s accuracy when an uncertain scenario is observed. Our experiments show that the modified models outperform all baselines by increasing the cumulative reward in the long run. Moreover, a counterfactual evaluation validates that such improvements were not simply achieved due to the bias of offline datasets.
Review information has been demonstrated beneficial for the explainable recommendation. It can be treated as training corpora for generation-based methods or knowledge bases for extraction-based models. However, for generation-based methods, the sparsity of user-generated reviews and the high complexity of generative language models lead to a lack of personalization and adaptability. For extraction-based methods, focusing only on relevant attributes makes them invalid in situations where explicit attribute words are absent, limiting the potential of extraction-based models.
To this end, in this paper, we focus on the explicit and implicit analysis of review information simultaneously and propose novel a Topic-enhanced Graph Neural Networks (TGNN) to fully explore review information for better explainable recommendations. To be specific, we first use a pre-trained topic model to analyze reviews at the topic level, and design a sentence-enhanced topic graph to model user preference explicitly, where topics are intermediate nodes between users and items. Corresponding sentences serve as edge features. Thus, the requirement of explicit attribute words can be mitigated. Meanwhile, we leverage a review-enhanced rating graph to model user preference implicitly, where reviews are also considered as edge features for fine-grained user-item interaction modeling. Next, user and item representations from two graphs are used for final rating prediction and explanation extraction. Extensive experiments on three real-world datasets demonstrate the superiority of our proposed TGNN with both recommendation accuracy and explanation quality.
A bundle is a group of items that provides improved services to users and increased profits for sellers. However, locating the desired bundles that match the users’ tastes still challenges us, due to the sparsity issue. Despite the remarkable performance of existing approaches, we argue that they seldom consider the bundling strategy (i.e., how the items within a bundle are associated with each other) in the bundle recommendation, resulting in the suboptimal user and bundle representations for their interaction prediction. Therefore, we propose to model the strategy-aware user and bundle representations for the bundle recommendation.
Towards this end, we develop a new model for bundle recommendation, termed Bundle Graph Transformer (BundleGT), which consists of the token embedding layer, hierarchical graph transformer (HGT) layer, and prediction layer. Specifically, in the token embedding layer, we take the items within bundles as tokens and represent them with items’ id embedding learned from user-item interactions. Having the input tokens, the HGT layer can simultaneously model the strategy-aware bundle and user representations. Therein, we encode the prior knowledge of bundling strategy from the well-designed bundles and incorporate it with tokens’ embeddings to model the bundling strategy and learn the strategy-aware bundle representations. Meanwhile, upon the correlation between bundles consumed by the same user, we further learn the user preference on bundling strategy. Jointly considering it with the user preference on the item content, we can learn the strategy-aware user representation for user-bundle interaction prediction.
Conducting extensive experiments on Youshu, ifashion, and Netease datasets, we demonstrate that our proposed model outperforms the state-of-the-art baselines (e.g., BundelNet  Net, BGCN  BGCN, and CrossCBR ), justifying the effectiveness of our proposed model. Moreover, in HGT layer, our devised light self-attention block improves not only the accuracy performance but efficiency of BundleGT. Our code is publicly available at: https://github.com/Xiaohao-Liu/BundleGT.
SESSION: Session 24 – Cross-modal, Cross-lingual, Multi-modal
In this work, we explore a Multilingual Information Retrieval (MLIR) task, where the collection includes documents in multiple languages. We demonstrate that applying state-of-the-art approaches developed for cross-lingual information retrieval to MLIR tasks leads to sub-optimal performance. This is due to the heterogeneous and imbalanced nature of multilingual collections — some languages are better represented in the collection and some benefit from large-scale training data. To address this issue, we present KD-SPD, a novel soft prompt decoding approach for MLIR that implicitly “translates” the representation of documents in different languages into the same embedding space. To address the challenges of data scarcity and imbalance, we introduce a knowledge distillation strategy. The teacher model is trained on rich English retrieval data, and by leveraging bi-text data, our distillation framework transfers its retrieval knowledge to the multilingual document encoder. Therefore, our approach does not require any multilingual retrieval training data. Extensive experiments on three MLIR datasets with a total of 15 languages demonstrate that KD-SPD significantly outperforms competitive baselines in all cases. We conduct extensive analyses to show that our method has less language bias and better zero-shot transfer ability towards new languages.
Learning sparse representations using pretrained language models enhances the monolingual ranking effectiveness. Such representations are sparse vectors in the vocabulary of a language model projected from document terms. Extending such approaches to Cross-Language Information Retrieval (CLIR) using multilingual pretrained language models poses two challenges. First, the larger vocabularies of multilingual models affect both training and inference efficiency. Second, the representations of terms from different languages with similar meanings might not be sufficiently similar. To address these issues, we propose a learned sparse representation model, BLADE, combining vocabulary pruning with intermediate pre-training based on cross-language supervision. Our experiments reveal BLADE significantly reduces indexing time compared to its monolingual counterpart, SPLADE, on machine-translated documents, and it generates rankings with strengths complementary to those of other efficient CLIR methods.
Cross-lingual Named Entity Recognition (NER) aims to address the challenge of data scarcity in low-resource languages by leveraging knowledge from high-resource languages. Most current work relies on general multilingual language models to represent text, and then uses classic combined tagging (e.g., B-ORG) to annotate entities; However, this approach neglects the lack of cross-lingual alignment of entity representations in language models, and also ignores the fact that entity spans and types have varying levels of labeling difficulty in terms of transferability. To address these challenges, we propose a novel framework, referred to as DLBri, which addresses the issues of representation and labeling simultaneously. Specifically, the proposed framework utilizes progressive contrastive learning with source-to-target oriented sentence pairs to pre-finetune the language model, resulting in improved cross-lingual entity-aware representations. Additionally, a decomposition-then-combination procedure is proposed, which separately transfers entity span and type, and then combines their information, to reduce the difficulty of cross-lingual entity labeling. Extensive experiments on 13 diverse language pairs confirm the effectiveness of DLBri.
Image-text retrieval, as a fundamental and important branch of information retrieval, has attracted extensive research attentions. The main challenge of this task is cross-modal semantic understanding and matching. Some recent works focus more on fine-grained cross-modal semantic matching. With the prevalence of large scale multimodal pretraining models, several state-of-the-art models (e.g. X-VLM) have achieved near-perfect performance on widely-used image-text retrieval benchmarks, i.e. MSCOCO-Test-5K and Flickr30K-Test-1K. In this paper, we review the two common benchmarks and observe that they are insufficient to assess the true capability of models on fine-grained cross-modal semantic matching. The reason is that a large amount of images and texts in the benchmarks are coarse-grained. Based on the observation, we renovate the coarse-grained images and texts in the old benchmarks and establish the improved benchmarks called MSCOCO-FG and Flickr30K-FG. Specifically, on the image side, we enlarge the original image pool by adopting more similar images. On the text side, we propose a novel semi-automatic renovation approach to refine coarse-grained sentences into finer-grained ones with little human effort. Furthermore, we evaluate representative image-text retrieval models on our new benchmarks to demonstrate the effectiveness of our method. We also analyze the capability of models on fine-grained semantic comprehension through extensive experiments. The results show that even the state-of-the-art models have much room for improvement in fine-grained semantic understanding, especially in distinguishing attributes of close objects in images. Our code and improved benchmark datasets are publicly available1 which we hope will inspire further in-depth research on cross-modal retrieval.
Image-text retrieval aims to bridge the modality gap and retrieve cross-modal content based on semantic similarities. Prior work usually focuses on the pairwise relations (i.e., whether a data sample matches another) but ignores the higher-order neighbor relations (i.e., a matching structure among multiple data samples). Re-ranking, a popular post-processing practice, has revealed the superiority of capturing neighbor relations in single-modality retrieval tasks. However, it is ineffective to directly extend existing re-ranking algorithms to image-text retrieval. In this paper, we analyze the reason from four perspectives, i.e., generalization, flexibility, sparsity, and asymmetry, and propose a novel learnable pillar-based re-ranking paradigm. Concretely, we first select top-ranked intra- and intermodal neighbors as pillars, and then reconstruct data samples with the neighbor relations between them and the pillars. In this way, each sample can be mapped into a multimodal pillar space only using similarities, ensuring generalization. After that, we design a neighbor-aware graph reasoning module to flexibly exploit the relations and excavate the sparse positive items within a neighborhood. We also present a structure alignment constraint to promote crossmodal collaboration and align the asymmetric modalities. On top of various base backbones, we carry out extensive experiments on two benchmark datasets, i.e., Flickr30K and MS-COCO, demonstrating the effectiveness, superiority, generalization, and transferability of our proposed re-ranking paradigm.
In addition to relevance, diversity is an important yet less studied performance metric of cross-modal image retrieval systems, which is critical to user experience. Existing solutions for diversity-aware image retrieval either explicitly post-process the raw retrieval results from standard retrieval systems or try to learn multi-vector representations of images to represent their diverse semantics. However, neither of them is good enough to balance relevance and diversity. On the one hand, standard retrieval systems are usually biased to common semantics and seldom exploit diversity-aware regularization in training, which makes it difficult to promote diversity by post-processing. On the other hand, multi-vector representation methods are not guaranteed to learn robust multiple projections. As a result, irrelevant images and images of rare or unique semantics may be projected inappropriately, which degrades the relevance and diversity of the results generated by some typical algorithms like top-k. To cope with these problems, this paper presents a new method called CoLT that tries to generate much more representative and robust representations for accurately classifying images. Specifically, CoLT first extracts semantics-aware image features by enhancing the preliminary representations of an existing one-to-one cross-modal system with semantics-aware contrastive learning. Then, a transformer-based token classifier is developed to subsume all the features into their corresponding categories. Finally, a post-processing algorithm is designed to retrieve images from each category to form the final retrieval result. Extensive experiments on two real-world datasets Div400 and Div150Cred show that CoLT can effectively boost diversity, and outperforms the existing methods as a whole (with a higher F1 score).
From Region to Patch: Attribute-Aware Foreground-Background Contrastive Learning for Fine-Grained Fashion Retrieval
Attribute-specific fashion retrieval (ASFR) is a challenging information retrieval task, which has attracted increasing attention in recent years. Different from traditional fashion retrieval which mainly focuses on optimizing holistic similarity, the ASFR task concentrates on attribute-specific similarity, resulting in more fine-grained and interpretable retrieval results. As the attribute-specific similarity typically corresponds to the specific subtle regions of images, we propose a Region-to-Patch Framework (RPF) that consists of a region-aware branch and a patch-aware branch to extract fine-grained attribute-related visual features for precise retrieval in a coarse-to-fine manner. In particular, the region-aware branch is first to be utilized to locate the potential regions related to the semantic of the given attribute. Then, considering that the located region is coarse and still contains the background visual contents, the patch-aware branch is proposed to capture patch-wise attribute-related details from the previous amplified region. Such a hybrid architecture strikes a proper balance between region localization and feature extraction. Besides, different from previous works that solely focus on discriminating the attribute-relevant foreground visual features, we argue that the attribute-irrelevant background features are also crucial for distinguishing the detailed visual contexts in a contrastive manner. Therefore, a novel E-InfoNCE loss based on the foreground and background representations is further proposed to improve the discrimination of attribute-specific representation. Extensive experiments on three datasets demonstrate the effectiveness of our proposed framework, and also show a decent generalization of our RPF on out-of-domain fashion images. Our source code is available at https://github.com/HuiGuanLab/RPF.
SESSION: Session 25 – E-commerce & Other Applications
Next-basket recommendation (NBR) is a type of recommendation that aims to recommend a set of items to users according to their historical basket sequences. Existing NBR methods suffer from two limitations: (1) overlooking low-level item correlations, which results in coarse-grained item representation; and (2) failing to consider spurious interests in repeated behaviors, leading to suboptimal user interest learning. To address these limitations, we propose a novel solution named Multi-view Multi-aspect Neural Recommendation (MMNR) for NBR, which first normalizes the interactions from both the user-side and item-side, respectively, aiming to remove the spurious interests, and utilizes them as weights for items from different views to construct differentiated representations for each interaction item, enabling comprehensive user interest learning. Then, to capture low-level item correlations, MMNR models different aspects of items to obtain disentangled representations of items, thereby fully capturing multiple user interests. Extensive experiments on real-world datasets demonstrate the effectiveness of MMNR, showing that it consistently outperforms several state-of-the-art NBR methods.
Online shops such as Amazon, eBay, and Etsy continue to expand their presence in multiple countries, creating new resource-scarce marketplaces with thousands of items. We consider a marketplace to be resource-scarce when only limited user-generated data is available about the products (e.g., ratings, reviews, and product-related questions). In such a marketplace, an information retrieval system is less likely to help users find answers to their questions about the products. As a result, questions posted online may go unanswered for extended periods. This study investigates the impact of using available data in a resource-rich marketplace to answer new questions in a resource-scarce marketplace, a new problem we call cross-market question answering. To study this problem’s potential impact, we collect and annotate a new dataset, XMarket-QA, from Amazon’s UK (resource-scarce) and US (resource-rich) local marketplaces. We conduct a data analysis to understand the scope of the cross-market question-answering task. This analysis shows a temporal gap of almost one year between the first question answered in the UK marketplace and the US marketplace. Also, it shows that the first question about a product is posted in the UK marketplace only when 28 questions, on average, have already been answered about the same product in the US marketplace. Human annotations demonstrate that, on average, 65% of the questions in the UK marketplace can be answered within the US marketplace, supporting the concept of cross-market question answering. Inspired by these findings, we develop a new method, CMJim, which utilizes product similarities across marketplaces in the training phase for retrieving answers from the resource-rich marketplace that can be used to answer a question in the resource-scarce marketplace. Our evaluations show CMJim’s significant improvement compared to competitive baselines.
Next Basket Recommendation (NBR) that recommends a basket of items to users has become a promising promotion artifice for online businesses. The key challenge of NBR is rooted in the complicated relations of items that are dependent on one another in a same basket with users’ diverse purchasing intentions, which goes far beyond the pairwise item relations in traditional recommendation tasks, and yet has not been well addressed by existing NBR methods that mostly model the inter-basket item relations only. To that end, in this paper, we construct a hypergraph from basket-wise purchasing records and probe the inter-basket and intra-basket item relations behind the hyperedges. In particular, we combine the strength of HyperGraph Neural Network with disentangled representation learning to derive the intent-aware representations of hyperedges for characterizing the nuances of user purchasing patterns. Moreover, considering the information loss in traditional item-wise optimization, we propose a novel basket-wise optimization scheme via an adversarial network to generate high-quality negative baskets. Extensive experiments conducted on four different data sets demonstrate the superior performances over the state-of-the-art NBR methods. Notably, our method is shown to strike a good balance in recommending both repeated and explorative items as a basket.
Modern online service providers such as online shopping platforms often provide both search and recommendation (S&R) services to meet different user needs. Rarely has there been any effective means of incorporating user behavior data from both S&R services. Most existing approaches either simply treat S&R behaviors separately, or jointly optimize them by aggregating data from both services, ignoring the fact that user intents in S&R can be distinctively different. In our paper, we propose a Search-Enhanced framework for the Sequential Recommendation (SESRec) that leverages users’ search interests for recommendation, by disentangling similar and dissimilar representations within S&R behaviors. Specifically, SESRec first aligns query and item embeddings based on users’ query-item interactions for the computations of their similarities. Two transformer encoders are used to learn the contextual representations of S&R behaviors independently. Then a contrastive learning task is designed to supervise the disentanglement of similar and dissimilar representations from behavior sequences of S&R. Finally, we extract user interests by the attention mechanism from three perspectives, i.e., the contextual representations, the two separated behaviors containing similar and dissimilar interests. Extensive experiments on both industrial and public datasets demonstrate that SESRec consistently outperforms state-of-the-art models. Empirical studies further validate that SESRec successfully disentangle similar and dissimilar user interests from their S&R behaviors.
Unsupervised readability assessment aims to evaluate the reading difficulty of text without any manually-labeled data for model training. This is a challenging task because the absence of labeled data makes it difficult for the model to understand what readability is. In this paper, we propose a novel framework to Learn a neural model from Weak Readability Signals (LWRS). Instead of relying on labeled data, LWRS utilizes a set of heuristic signals that specialize in describing text readability from different aspects to guide the model in outputting readability scores for ranking. Specifically, to effectively use multiple heuristic weak signals for model training, we build a multi-signal learning model that ranks the unlabeled texts from multiple readability-related aspects based on intra- and inter-signal learning. We also adopt the pairwise ranking paradigm to reduce the cascade coupling among partial-order pairs. Furthermore, we propose identifying the most representative signal based on the batch-level consensus distribution of all signals. This strategy helps identify the predicted signal that is most correlated with readability in the absence of ground-truth labels. We conduct experiments on three public readability assessment datasets. The experimental results demonstrate that our LWRS outperforms each heuristic signal and their combinations significantly, and can even perform comparably with some supervised methods. Additionally, our LWRS trained on one dataset can be effectively transferred to other datasets, including those in other languages, which indicates its good generalization and potential for wide application.
Many texts, especially in chemistry and biology, describe complex processes. We focus on texts that describe a chemical reaction process and questions that ask about the process’s outcome under different environmental conditions. To answer questions about such processes, one needs to understand the interactions between the different entities involved in the process and simulate their state transitions during the process execution under other conditions. We hypothesize that generating code and executing it to simulate the process will allow answering such questions. We, therefore, define a domain-specific language (DSL) to represent processes. We contribute to the community a unique dataset curated by chemists and annotated by computer scientists. The dataset is composed of process texts, simulation questions, and their corresponding computer codes represented by the DSL. We propose a neural program synthesis approach based on reinforcement learning with a novel state-transition semantic reward. The novel reward is based on the run-time semantic similarity between the predicted code and the reference code. This allows simulating complex process transitions and thus answering simulation questions. Our approach yields a significant boost in accuracy for simulation questions: we achieved 88% accuracy as opposed to 83% accuracy of the state-of-the-art neural program synthesis approaches and 54% accuracy of state-of-the-art end-to-end text-based approaches.
ErrorCLR: Semantic Error Classification, Localization and Repair for Introductory Programming Assignments
Programming education at scale increasingly relies on automated feedback to help students learn to program. An important form of feedback is to point out semantic errors in student programs and provide hints for program repair. Such automated feedback depends essentially on solving the tasks of classification, localization and repair of semantic errors. Although there are datasets for the tasks, we observe that they do not have the annotations supporting all three tasks. As such, existing approaches for semantic error feedback treat error classification, localization and repair as independent tasks, resulting in sub-optimal performance on each task. Moreover, existing datasets either contain few programming assignments or have few programs for each assignment. Therefore, existing approaches often leverage rule-based methods and evaluate them with a small number of programming assignments. To tackle the problems, we first describe the creation of a new dataset COJ2022 that contains 5,914 C programs with semantic errors submitted to 498 different assignments in an introductory programming course, where each program is annotated with the error types and locations and is coupled with the repaired program submitted by the same student. We show the advantages of COJ2022 over existing datasets on various aspects. Second, we treat semantic error classification, localization and repair as dependent tasks, and propose a novel two-stage method ErrorCLR to solve them. Specifically, in the first stage we train a model based on graph matching networks to jointly classify and localize potential semantic errors in student programs, and in the second stage we mask error spans in buggy programs using information of error types and locations and train a CodeT5 model to predict correct spans. The predicted spans replace the error spans to form repaired programs. Experimental results show that ErrorCLR remarkably outperforms the comparative methods for all three tasks on COJ2022 and other public datasets. We also conduct a case study to visualize and interpret what is learned by the graph matching network in ErrorCLR. We have released the source code and COJ2022 at https://github.com/DaSESmartEdu/ErrorCLR.
SESSION: Session 26 – Query Performance and CTR Prediction
Thanks to recent advances in IR and NLP, the way users interact with search engines is evolving rapidly, with multi-turn conversations replacing traditional one-shot textual queries. Given its interactive nature, Conversational Search (CS) is one of the scenarios that can benefit the most from Query Performance Prediction (QPP) techniques. QPP for the CS domain is a relatively new field and lacks proper framing. In this study, we address this gap by proposing a framework for the application of QPP in the CS domain and use it to evaluate the performance of predictors. We characterize what it means to predict the performance in the CS scenario, where information needs are not independent queries but a series of closely related utterances. We identify three main ways to use QPP models in the CS domain: as a diagnostic tool, as a way to adjust the system’s behaviour during a conversation, or as a way to predict the system’s performance on the next utterance. Due to the lack of established evaluation procedures for QPP in the CS domain, we propose a protocol to evaluate QPPs for each of the use cases. Additionally, we introduce a set of spatial-based QPP models designed to work the best in the conversational search domain, where dense neural retrieval models are the most common approaches and query cutoffs are typically small. We show how the proposed QPP approaches improve significantly the predictive performance over the state-of-the-art in different scenarios and collections.
DMBIN: A Dual Multi-behavior Interest Network for Click-Through Rate Prediction via Contrastive Learning
Click-through rate (CTR) prediction plays a critical role in various online applications, aiming to estimate the user’s click probability. User interest modeling from various interactive behaviors(e.g., click, add-to-cart, order) is becoming a mainstream approach to CTR prediction. We argue that the various user behaviors contain two important intrinsic characteristics: 1) The discrepancy in various behaviors reveals different aspects of user’s behavior-specific interests. For example, one may click out of need but pay more attention to the rating when purchasing. 2) The consistency of various behaviors contains user’s behavior-invariant interest. For example, the user prefers interacted items rather than other items. Therefore, it is necessary to disentangle the discrepancy and consistency signals from the massive behavior information. Unfortunately, previous methods have yet to study this phenomenon well, which limits the recommendation performance.
To tackle this challenge, we propose a novel Dual Multi-Behavior Interest Network (DMBIN for short) to disentangle behavior-specific and behavioral-invariant interests from various behaviors for a better recommendation. Specifically, DMBIN formalizethe discrepancy and consistency characteristics among various behaviors. Extensive experiments and empirical analysis on two real-world datasets demonstrate that DMBIN significantly outperforms the state-of-the-art methods. Moreover, DMBIN is also deployed in the online sponsored search advertising system in Meituan and achieves 2.11% and 2.76% improvement on CTR and CPM, respectively.s the dismantlement task as two contrastive learning tasks of multi-behavior interests extracted through the Multi-behavior Interest Module: Multi-behavior Interest Contrast(MIC) task and Multi-behavior Interest Alignment(MIA) task. These two tasks focus on extracting
Learning effective high-order feature interactions is very crucial in the CTR prediction task. However, it is very time-consuming to calculate high-order feature interactions with massive features in online e-commerce platforms. Most existing methods manually design a maximal order and further filter out the useless interactions from them. Although they reduce the high computational costs caused by the exponential growth of high-order feature combinations, they still suffer from the degradation of model capability due to the suboptimal learning of the restricted feature orders. The solution to maintain the model capability and meanwhile keep it efficient is a technical challenge, which has not been adequately addressed. To address this issue, we propose an adaptive feature interaction learning model, named as EulerNet, in which the feature interactions are learned in a complex vector space by conducting space mapping according to Euler’s formula. EulerNet converts the exponential powers of feature interactions into simple linear combinations of the modulus and phase of the complex features, making it possible to adaptively learn the high-order feature interactions in an efficient way. Furthermore, EulerNet incorporates the implicit and explicit feature interactions into a unified architecture, which achieves the mutual enhancement and largely boosts the model capabilities. Such a network can be fully learned from data, with no need of pre-designed form or order for feature interactions. Extensive experiments conducted on three public datasets have demonstrated the effectiveness and efficiency of our approach. Our code is available at: https://github.com/RUCAIBox/EulerNet.
Click-Through Rate (CTR) prediction plays a core role in recommender systems, serving as the final-stage filter to rank items for a user. The key to addressing the CTR task is learning feature interactions that are useful for prediction, which is typically achieved by fitting historical click data with the Empirical Risk Minimization (ERM) paradigm. Representative methods include Factorization Machines and Deep Interest Network, which have achieved wide success in industrial applications. However, such a manner inevitably learns unstable feature interactions, i.e., the ones that exhibit strong correlations in historical data but generalize poorly for future serving.
In this work, we reformulate the CTR task — instead of pursuing ERM on historical data, we split the historical data chronologically into several periods (a.k.a, environments), aiming to learn feature interactions that are stable across periods. Such feature interactions are supposed to generalize better to predict future behavior data. Nevertheless, a technical challenge is that existing invariant learning solutions like Invariant Risk Minimization are not applicable, since the click data entangles both environment-invariant and environment-specific correlations. To address this dilemma, we propose Disentangled Invariant Learning (DIL) which disentangles feature embeddings to capture the two types of correlations separately. To improve the modeling efficiency, we further design LightDIL which performs the disentanglement at the higher level of the feature field. Extensive experiments demonstrate the effectiveness of DIL in learning stable feature interactions for CTR.
Popularity detection of news articles is critical for making relevant recommendations for users and drive user engagement for maximum business value. Among several well-known metrics such as likes, shares, comments, Click-Through-Rate (CTR) has evolved as a default metric of popularity. However, CTR is highly influenced by the probability of news articles getting an impression, which in turn depends on the recommendation algorithm. Furthermore, it does not consider the age of the news articles, which are highly perishable and also misses out on human contextual behavioral preferences towards news. Here, we use the MIND dataset, open sourced by Microsoft to investigate the existing metrics of popularity and propose six new metrics. Our aim is to create awareness about the different perspectives of measuring popularity while discussing the advantages and disadvantages of the proposed metrics with respect to the human click behavior. We evaluated the predictability of the proposed metrics in comparison to CTR prediction. We further evaluated the utility of the proposed metrics through different test cases. Our results indicate that by using appropriate popularity metrics, we can reduce the initial news corpus (item set) by 50% and still could achieve 99% of the total clicks as compared to unfiltered news corpus based recommender systems. Similarly, our results show that we can reduce the effective number of articles recommended per impression that could improve user experience with the news platforms. The metrics proposed in this paper can be useful in other contexts, especially in recommenders with perishable items e.g. video reels or blogs.
The delayed feedback is becoming one of the main obstacles in online advertising due to the pervasive deployment of the cost-per-conversion display strategy requesting a real-time conversion rate (CVR) prediction. It makes the observed data contain a large number of fake negatives that temporarily have no feedback but will convert later. Training on such biased data distribution would severely harm the performance of models. Prevailing approaches wait for a set period of time to see if samples convert before training on them, but solutions to guaranteeing data freshness remain under-explored by current research. In this work, we propose Delayed Feed-back modeling via neural Satellite Networks (DFSN for short) for online CVR prediction. It tackles the issue of data freshness to permit adaptive waiting windows. We first assign a long waiting window for our main model to cover most of conversions and greatly reduce fake negatives. Meanwhile, two kinds of satellite models are devised to learn from the latest data, and online transfer learning techniques are utilized to sufficiently exploit their knowledge. With information from satellites, our main model can deal with the issue of data freshness, achieving better performance than previous methods. Extensive experiments on two real-world advertising datasets demonstrate the superiority of our model.
SESSION: Session 27 – Summarisation & Text Generation
Automatic summarization plays an important role in the exponential document growth on the Web. On content websites such as CNN.com and WikiHow.com, there often exist various kinds of side information along with the main document for attention attraction and easier understanding, such as videos, images, and queries. Such information can be used for better summarization, as they often explicitly or implicitly mention the essence of the article. However, most of the existing side-aware summarization methods are designed to incorporate either single-modal or multi-modal side information, and cannot effectively adapt to each other. In this paper, we propose a general summarization framework, which can flexibly incorporate various modalities of side information. The main challenges in designing a flexible summarization model with side information include: (1) the side information can be in textual or visualformat, and the model needs to align and unify it with the document into the same semantic space, (2) the side inputs can contain information from variousaspects, and the model should recognize the aspects useful for summarization. To address these two challenges, we first propose a unified topic encoder, which jointly discovers latent topics from the document and various kinds of side information. The learned topics flexibly bridge and guide the information flow between multiple inputs in a graph encoder through a topic-aware interaction. We secondly propose a triplet contrastive learning mechanism to align the single-modal or multi-modal information into a unified semantic space, where thesummary quality is enhanced by better understanding thedocument andside information. Results show that our model significantly surpasses strong baselines on three public single-modal or multi-modal benchmark summarization datasets.
Systematic reviews are comprehensive literature reviews for a highly focused research question. These reviews are considered the highest form of evidence in medicine. Complex Boolean queries are developed as part of the systematic review creation process to retrieve literature, as they permit reproducibility and understandability. However, it is difficult and time-consuming to develop high-quality Boolean queries, often requiring the expertise of expert searchers like librarians. Recent advances in transformer-based generative models have shown their ability to effectively follow user instructions and generate answers based on these instructions. In this paper, we investigate ChatGPT as a means for automatically formulating and refining complex Boolean queries for systematic review literature search. Overall, our research finds that ChatGPT has the potential to generate effective Boolean queries. The ability of ChatGPT to follow complex instructions and generate highly precise queries makes it a tool of potential value for researchers conducting systematic reviews, particularly for rapid reviews where time is a constraint and where one can trade off higher precision for lower recall. We also identify several caveats in using ChatGPT for this task, highlighting that this technology needs further validation before it is suitable for widespread uptake.
Retrieval-augmented generation models offer many benefits over standalone language models: besides a textual answer to a given query they provide provenance items retrieved from an updateable knowledge base. However, they are also more complex systems and need to handle long inputs. In this work, we introduce FiD-Light to strongly increase the efficiency of the state-of-the-art retrieval-augmented FiD model, while maintaining the same level of effectiveness. Our FiD-Light model constrains the information flow from the encoder (which encodes passages separately) to the decoder (using concatenated encoded representations). Furthermore, we adapt FiD-Light with re-ranking capabilities through textual source pointers, to improve the top-ranked provenance precision. Our experiments on a diverse set of seven knowledge intensive tasks (KILT) show FiD-Light consistently improves the Pareto frontier between query latency and effectiveness. FiD-Light with source pointing sets substantial new state-of-the-art results on six KILT tasks for combined text generation and provenance retrieval evaluation, while maintaining high efficiency.
Knowledge-intensive language tasks (KILTs) benefit from retrieving high-quality relevant contexts from large external knowledge corpora. Learning task-specific retrievers that return relevant contexts at an appropriate level of semantic granularity, such as a document retriever, passage retriever, sentence retriever, and entity retriever, may help to achieve better performance on the end-to-end task. But a task-specific retriever usually has poor generalization ability to new domains and tasks, and it may be costly to deploy a variety of specialised retrievers in practice.
We propose a unified generative retriever (UGR) that combines task-specific effectiveness with robust performance over different retrieval tasks in KILTs. To achieve this goal, we make two major contributions: (i) To unify different retrieval tasks into a single generative form, we introduce an n-gram-based identifier for relevant contexts at different levels of granularity in KILTs. And (ii) to address different retrieval tasks with a single model, we employ a prompt learning strategy and investigate three methods to design prompt tokens for each task. In this way, the proposed UGR model can not only share common knowledge across tasks for better generalization, but also perform different retrieval tasks effectively by distinguishing task-specific characteristics. We train UGR on a heterogeneous set of retrieval corpora with well-designed prompts in a supervised and multi-task fashion. Experimental results on the KILT benchmark demonstrate the effectiveness of UGR on in-domain datasets, out-of-domain datasets, and unseen tasks.
Test2SQL, a natural language interface to database querying, has seen considerable improvement, in part due to advances in deep learning. However, despite recent improvement, existing Text2SQL proposals allow only input in the form of complete questions. This leaves behind users who struggle to formulate complete questions, e.g., because they lack database expertise or are unfamiliar with the underlying database schema. To address this shortcoming, we study the novel problem of Text2SQL Auto-Completion (TSAC) that extends Text2SQL to also take partial or incomplete questions as input. Specifically, the TSAC problem is to predict the complete, executable SQL query. To solve the problem, we propose a novel Relation-aware Historical Bridging Network (RHB-Net) that consists of a relation-aware union encoder and an extraction-generation sensitive decoder. RHB-Net models relations between questions and database schemas and predicts the ambiguous intents expressed in partial queries. We also propose two optimization strategies: historical query bridging that fuses historical database queries, and a dynamic context construction that prevents repeated generation of the same SQL elements. Extensive experiments with real-world data offer evidence that RHB-Net is capable of outperforming baseline algorithms.
SESSION: Session 28 – Cross-domain Recommendation
M2GNN: Metapath and Multi-interest Aggregated Graph Neural Network for Tag-based Cross-domain Recommendation
Cross-domain recommendation (CDR) is an effective way to alleviate the data sparsity problem. Content-based CDR is one of the most promising branches since most kinds of products can be described by a piece of text, especially when cold-start users or items have few interactions. However, two vital issues are still under-explored: (1) From the content modeling perspective, sufficient long-text descriptions are usually scarce in a real recommender system, more often the light-weight textual features, such as a few keywords or tags, are more accessible, which is improperly modeled by existing methods. (2) From the CDR perspective, not all inter-domain interests are helpful to infer intra-domain interests. Caused by domain-specific features, there are part of signals benefiting for recommendation in the source domain but harmful for that in the target domain. Therefore, how to distill useful interests is crucial. To tackle the above two problems, we propose a metapath and multi-interest aggregated graph neural network (M2GNN). Specifically, to model the tag-based contents, we construct a heterogeneous information network to hold the semantic relatedness between users, items, and tags in all domains. The metapath schema is predefined according to domain-specific knowledge, with one metapath for one domain. User representations are learned by GNN with a hierarchical aggregation framework, where the intra-metapath aggregation firstly filters out trivial tags and the inter-metapath aggregation further filters out useless interests. Offline experiments and online A/B tests demonstrate that M2GNN achieves significant improvements over the state-of-the-art methods and current industrial recommender system in Dianping, respectively. Further analysis shows that M2GNN offers an interpretable recommendation.
Cross-Domain Recommendation (CDR) is a widely used approach for leveraging information from domains with rich data to assist domains with insufficient data. A key challenge of CDR research is the effective and efficient transfer of helpful information from source domain to target domain. Currently, most existing CDR methods focus on extracting implicit information from the source domain to enhance the target domain. However, the hidden structure of the extracted implicit information is highly dependent on the specific CDR model, and is therefore not easily reusable or transferable. Additionally, the extracted implicit information only appears within the intermediate substructure of specific CDRs during training and is thus not easily retained for more use. In light of these challenges, this paper proposes AutoTransfer, with an Instance Transfer Policy Network, to selectively transfers instances from source domain to target domain for improved recommendations. Specifically, AutoTransfer acts as an agent that adaptively selects a subset of informative and transferable instances from the source domain. Notably, the selected subset possesses extraordinary re-utilization property that can be saved for improving model training of various future RS models in target domain. Experimental results on two public CDR benchmark datasets demonstrate that the proposed method outperforms state-of-the-art CDR baselines and classic Single-Domain Recommendation (SDR) approaches. The implementation code is available for easy reproduction.
Cross-Domain Recommendation (CDR) is capable of incorporating auxiliary information from multiple domains to advance recommendation performance. Conventional CDR methods primarily rely on overlapping users, whereby knowledge is conveyed between the source and target identities belonging to the same natural person. However, such a heuristic assumption is not universally applicable due to an individual may exhibit distinct or even conflicting preferences in different domains, leading to potential noises. In this paper, we view the anchor links between users of various domains as the learnable parameters to learn the task-relevant cross-domain correlations. A novel optimal transport based model ALCDR is further proposed to precisely infer the anchor links and deeply aggregate collaborative signals from the perspectives of intra-domain and inter-domain. Our proposal is extensively evaluated over real-world datasets, and experimental results demonstrate its superiority.
With the explosive growth of commercial applications of recommender systems, multi-scenario recommendation (MSR) has attracted considerable attention, which utilizes data from multiple domains to improve their recommendation performance simultaneously. However, training a unified deep recommender system (DRS) may not explicitly comprehend the commonality and difference among domains, whereas training an individual model for each domain neglects the global information and incurs high computation costs. Likewise, fine-tuning on each domain is inefficient, and recent advances that apply the prompt tuning technique to improve fine-tuning efficiency rely solely on large-sized transformers. In this work, we propose a novel prompt-enhanced paradigm for multi-scenario recommendation. Specifically, a unified DRS backbone model is first pre-trained using data from all the domains in order to capture the commonality across domains. Then, we conduct prompt tuning with two novel prompt modules, capturing the distinctions among various domains and users. Our experiments on Douban, Amazon, and Ali-CCP datasets demonstrate the effectiveness of the proposed paradigm with two noticeable strengths: (i) its great compatibility with various DRS backbone models, and (ii) its high computation and storage efficiency with only 6% trainable parameters in prompt tuning phase. The implementation code is available for easy reproduction.
SESSION: Session 29 – Multimodal & Multimedia
Multimedia recommendation methods aim to discover the user preference on the multi-modal information to enhance the collaborative filtering (CF) based recommender system. Nevertheless, they seldom consider the impact of feature extraction on the user preference modeling and prediction of the user-item interaction, as the extracted features contain excessive information irrelevant to the recommendation.
To capture the informative features from the extracted ones, we resort to Transformer model to establish the correlation between the items historically interacted by the same user. Considering its challenges in effectiveness and efficiency, we propose a novel Transformer-based recommendation model, termed as Light Graph Transformer model (LightGT). Therein, we develop a modal-specific embedding and a layer-wise position encoder for the effective similarity measurement, and present a light self-attention block to improve the efficiency of self-attention scoring. Based on these designs, we can effectively and efficiently learn the user preference from the off-the-shelf items’ features to predict the user-item interactions. Conducting extensive experiments on Movielens, Tiktok and Kwai datasets, we demonstrate that LigthGT significantly outperforms the state-of-the-art baselines with less time. Our code is publicly available at: https://github.com/Liuwq-bit/LightGT.
Textual response generation is an essential task for multimodal task-oriented dialog systems. Although existing studies have achieved fruitful progress, they still suffer from two critical limitations: 1) focusing on the attribute knowledge but ignoring the relation knowledge that can reveal the correlations between different entities and hence promote the response generation, and 2)only conducting the cross-entropy loss based output-level supervision but lacking the representation-level regularization. To address these limitations, we devise a novel multimodal task-oriented dialog system (named MDS-S2). Specifically, MDS-S2 first simultaneously acquires the context related attribute and relation knowledge from the knowledge base, whereby the non-intuitive relation knowledge is extracted by the n-hop graph walk. Thereafter, considering that the attribute knowledge and relation knowledge can benefit the responding to different levels of questions, we design a multi-level knowledge composition module in MDS-S^2 to obtain the latent composed response representation. Moreover, we devise a set of latent query variables to distill the semantic information from the composed response representation and the ground truth response representation, respectively, and thus conduct the representation-level semantic regularization. Extensive experiments on a public dataset have verified the superiority of our proposed MDS-S2. We have released the codes and parameters to facilitate the research community.
Multimodal representation learning has shown promising improvements on various vision-language tasks (e.g., image-text retrieval, visual question answering, etc) and has significantly advanced the development of multimedia information systems. Most existing methods excel at building global-level alignment between vision and language while lacking effective fine-grained image-text interaction. In this paper, we propose a jointly masked multimodal modeling method to learn fine-grained multimodal representations. Our method performs joint masking on image-text input and integrates both implicit and explicit targets for the masked signals to recover. The implicit target provides a unified and debiased objective for vision and language, where the model predicts latent multimodal representations of the unmasked input. The explicit target further enriches the multimodal representations by recovering high-level and semantically meaningful information: momentum visual features of image patches and concepts of word tokens. Through such a masked modeling process, our model not only learns fine-grained multimodal interaction, but also avoids the semantic gap between high-level representations and low-or mid-level prediction targets (e.g., image pixels, discrete vision tokens), thus producing semantically rich multimodal representations that perform well on both zero-shot and fine-tuned settings. Our pre-trained model (named MAMO) achieves state-of-the-art performance on various downstream vision-language tasks, including image-text retrieval, visual question answering, visual reasoning, and weakly-supervised visual grounding.
Multimedia-based recommendation (MMRec) utilizes multimodal content (images, textual descriptions, etc.) as auxiliary information on historical interactions to determine user preferences. Most MMRec approaches predict user interests by exploiting a large amount of multimodal contents of user-interacted items, ignoring the potential effect of multimodal content of user-uninteracted items. As a matter of fact, there is a small portion of user preference-irrelevant features in the multimodal content of user-interacted items, which may be a kind of spurious correlation with user preferences, thereby degrading the recommendation performance. In this work, we argue that the multimodal content of user-uninteracted items can be further exploited to identify and eliminate the user preference-irrelevant portion inside user-interacted multimodal content, for example by counterfactual inference of causal theory. Going beyond multimodal user preference modeling only using interacted items, we propose a novel model called Multimodal Counterfactual Learning Network (MCLN), in which user-uninteracted items’ multimodal content is additionally exploited to further purify the representation of user preference-relevant multimodal content that better matches the user’s interests, yielding state-of-the-art performance. Extensive experiments are conducted to validate the effectiveness and rationality of MCLN. We release the complete codes of MCLN at https://github.com/hfutmars/MCLN.
Legal case matching, which automatically constructs a model to estimate the similarities between the source and target cases, has played an essential role in intelligent legal systems. Semantic text matching models have been applied to the task where the source and target legal cases are considered as long-form text documents. These general-purpose matching models make the predictions solely based on the texts in the legal cases, overlooking the essential role of the law articles in legal case matching. In the real world, the matching results (e.g., relevance labels) are dramatically affected by the law articles because the contents and the judgments of a legal case are radically formed on the basis of law. From the causal sense, a matching decision is affected by the mediation effect from the cited law articles by the legal cases, and the direct effect of the key circumstances (e.g., detailed fact descriptions) in the legal cases. In light of the observation, this paper proposes a model-agnostic causal learning framework called Law-Match, under which the legal case matching models are learned by respecting the corresponding law articles. Given a pair of legal cases and the related law articles, Law-Match considers the embeddings of the law articles as instrumental variables(IVs), and the embeddings of legal cases as treatments. Using IV regression, the treatments can be decomposed into law-related and law-unrelated parts, respectively reflecting the mediation and direct effects. These two parts are then combined with different weights to collectively support the final matching prediction. We show that the framework is model-agnostic, and a number of legal case matching models can be applied as the underlying models. Comprehensive experiments show that Law-Match can outperform state-of-the-art baselines on three public datasets.
SESSION: Session 30 – Temporal & Dynamic – Part 2
Reasoning on temporal knowledge graphs (TKGR), aiming to infer missing events along the timeline, has been widely studied to alleviate incompleteness issues in TKG, which is composed of a series of KG snapshots at different timestamps. Two types of information, i.e., intra-snapshot structural information and inter-snapshot temporal interactions, mainly contribute to the learned representations for reasoning in previous models. However, these models fail to leverage (1) semantic correlations between relationships for the former information and (2) the periodic temporal patterns along the timeline for the latter one. Thus, such insufficient mining manners hinder expressive ability, leading to sub-optimal performances. To address these limitations, we propose a novel reasoning model, termed RPC, which sufficiently mines the information underlying the Relational correlations and Periodic patterns via two novel Correspondence units, i.e., relational correspondence unit (RCU) and periodic correspondence unit (PCU). Concretely, relational graph convolutional network (RGCN) and RCU are used to encode the intra-snapshot graph structural information for entities and relations, respectively. Besides, the gated recurrent units (GRU) and PCU are designed for sequential and periodic inter-snapshot temporal interactions, separately. Moreover, the model-agnostic time vectors are generated by time2vector encoders to guide the time-dependent decoder for fact scoring. Extensive experiments on six benchmark datasets show that RPC outperforms the state-of-the-art TKGR models, and also demonstrate the effectiveness of two novel strategies in our model.
Most real-world networks evolve over time. Existing literature proposes models for dynamic networks that are either unlabeled or assumed to have a single membership structure. On the other hand, a new family of Mixed Membership Stochastic Block Models (MMSBM) allows to model static labeled networks under the assumption of mixed-membership clustering. In this work, we propose to extend this later class of models to infer dynamic labeled networks under a mixed membership assumption. Our approach takes the form of a temporal prior on the model’s parameters. It relies on the single assumption that dynamics are not abrupt. We show that our method significantly differs from existing approaches, and allows to model more complex systems –dynamic labeled networks. We demonstrate the robustness of our method with several experiments on both synthetic and real-world datasets. A key interest of our approach is that it needs very few training data to yield good results. The performance gain under challenging conditions broadens the variety of possible applications of automated learning tools –as in social sciences, which comprise many fields where small datasets are a major obstacle to the introduction of machine learning methods.
DREAM: Adaptive Reinforcement Learning based on Attention Mechanism for Temporal Knowledge Graph Reasoning
Temporal knowledge graphs (TKGs) model the temporal evolution of events and have recently attracted increasing attention. Since TKGs are intrinsically incomplete, it is necessary to reason out missing elements. Although existing TKG reasoning methods have the ability to predict missing future events, they fail to generate explicit reasoning paths and lack explainability. As reinforcement learning (RL) for multi-hop reasoning on traditional knowledge graphs starts showing superior explainability and performance in recent advances, it has opened up opportunities for exploring RL techniques on TKG reasoning. However, the performance of RL-based TKG reasoning methods is limited due to: (1) lack of ability to capture temporal evolution and semantic dependence jointly; (2) excessive reliance on manually designed rewards. To overcome these challenges, we propose an adaptive reinforcement learning model based on attention mechanism (DREAM) to predict missing elements in the future. Specifically, the model contains two components: (1) a multi-faceted attention representation learning method that captures semantic dependence and temporal evolution jointly; (2) an adaptive RL framework that conducts multi-hop reasoning by adaptively learning the reward functions. Experimental results demonstrate DREAM outperforms state-of-the-art models on public datasets.
Graph neural network (GNN) based algorithms have achieved superior performance in recommendation tasks due to their advanced capability of exploiting high-order connectivity between users and items. However, most existing GNN-based recommendation models ignore the dynamic evolution of nodes, where users will continuously interact with items over time, resulting in rapid changes in the environment (e.g., neighbor and structure). Moreover, the heuristic normalization of embeddings in dynamic recommendation is de-coupled with the model learning process, making the whole system suboptimal. In this paper, we propose a novel framework for generating satisfying recommendations in dynamic environments, called Dynamic Graph Evolution Learning (DGEL). First, we design three efficient real-time update learning methods for nodes from the perspectives of inherent interaction potential, time-decay neighbor augmentation, and symbiotic local structure learning. Second, we construct the re-scaling enhancement networks for dynamic embeddings to adaptively and automatically bridge the normalization process with model learning. Third, we leverage the interaction matching task and the future prediction task together for joint training to further improve performance. Extensive experiments on three real-world datasets demonstrate the effectiveness and improvements of our proposed DGEL. The code is available at https://github.com/henrictang/DGEL.
Reinforcement learning-based recommender systems have recently gained popularity. However, the design of the reward function, on which the agent relies to optimize its recommendation policy, is often not straightforward. Exploring the causality underlying users’ behavior can take the place of the reward function in guiding the agent to capture the dynamic interests of users. Moreover, due to the typical limitations of simulation environments (e.g., data ineffi- ciency), most of the work cannot be broadly applied in large-scale situations. Although some works attempt to convert the offline dataset into a simulator, data inefficiency makes the learning pro- cess even slower. Because of the nature of reinforcement learning (i.e., learning by interaction), it cannot collect enough data to train during a single interaction. Furthermore, traditional reinforcement learning algorithms do not have a solid capability like supervised learning methods to learn from offline datasets directly. In this paper, we propose a new model named the causal decision transformer for recommender systems (CDT4Rec). CDT4Rec is an offline reinforce- ment learning system that can learn from a dataset rather than from online interaction. Moreover, CDT4Rec employs the transformer architecture, which is capable of processing large offline datasets and capturing both short-term and long-term dependencies within the data to estimate the causal relationship between action, state, and reward. To demonstrate the feasibility and superiority of our model, we have conducted experiments on six real-world offline datasets and one online simulator.
SESSION: Session 31 – Implicit Feedback and Sessions
Graphs, as a non-linear data structure, are ubiquitous in practice, and efficient graph analysis can benefit important information retrieval applications in the era of big data. Currently, one of the fundamental graph mining problems is graph embedding, which aims to represent the graph as a low-dimensional feature vector with the content and structural information in the graph preserved. Although the graph embedding technique has evolved considerably, traditional methods mainly focus on node pairwise relationship in graphs, which makes the representational power of such schemes limited. Recently, a number of works have explored the simplicial complexes, which describe the higher-order interactions between nodes in the graphs, and further proposed several Graph Neural Network (GNN) algorithms based on simplicial complexes. However, these GNN approaches are highly inefficient in terms of running time and space, due to massive parameter learning. In this paper, we propose a simple and speedy graph embedding algorithm dubbed SCHash. Through adopting the Locality Sensitive Hashing (LSH) technique, SCHash captures the higher-order information derived from the simplicial complex in the GNN framework, and it can achieve a good balance between accuracy and efficiency. Our extensive experiments clearly show that, in terms of accuracy, the performance of our proposed SCHash algorithm is comparable to that of state-of-the-art GNN algorithms; also, SCHash achieves higher accuracy than the existing LSH algorithms. In terms of efficiency, SCHash runs faster than GNN algorithms by 2 ~ 4 orders of magnitude, and is more efficient than the existing LSH algorithms.
Diversification of recommendation results is a promising approach for coping with the uncertainty associated with users’ information needs. Of particular importance in diversified recommendation is to define and optimize an appropriate diversity objective. In this study, we revisit the most popular diversity objective called intra-list distance (ILD), defined as the average pairwise distance between selected items, and a similar but lesser known objective called dispersion, which is the minimum pairwise distance. Owing to their simplicity and flexibility, ILD and dispersion have been used in a plethora of diversified recommendation research. Nevertheless, we do not actually know what kind of items are preferred by them.
We present a critical reexamination of ILD and dispersion from theoretical and experimental perspectives. Our theoretical results reveal that these objectives have potential drawbacks: ILD may select duplicate items that are very close to each other, whereas dispersion may overlook distant item pairs. As a competitor to ILD and dispersion, we design a diversity objective called Gaussian ILD, which can interpolate between ILD and dispersion by tuning the bandwidth parameter. We verify our theoretical results by experimental results using real-world data and confirm the extreme behavior of ILD and dispersion in practice.
This paper is the first to use contrastive learning to improve the robustness of graph representation learning for signed bipartite graphs, which are commonly found in social networks, recommender systems, and paper review platforms. Existing contrastive learning methods for signed graphs cannot capture implicit relations between nodes of the same type in signed bipartite graphs, which have two types of nodes and edges only connect nodes of different types. We propose a Signed Bipartite Graph Contrastive Learning (SBGCL) method to learn robust node representation while retaining the implicit relations between nodes of the same type. SBGCL augments a signed bipartite graph with a novel two-level graph augmentation method. At the top level, we maintain two perspectives of the signed bipartite graph, one presents the original interactions between nodes of different types, and the other presents the implicit relations between nodes of the same type. At the bottom level, we employ stochastic perturbation strategies to create two perturbed graphs in each perspective. Then, we construct positive and negative samples from the perturbed graphs and design a multi-perspective contrastive loss to unify the node presentations learned from the two perspectives. Results show proposed model is effective over state-of-the-art methods on real-world datasets.
Linear autoencoder models learn an item-to-item weight matrix via convex optimization with L2 regularization and zero-diagonal constraints. Despite their simplicity, they have shown remarkable performance compared to sophisticated non-linear models. This paper aims to theoretically understand the properties of two terms in linear autoencoders. Through the lens of singular value decomposition (SVD) and principal component analysis (PCA), it is revealed that L2 regularization enhances the impact of high-ranked PCs. Meanwhile, zero-diagonal constraints reduce the impact of low-ranked PCs, leading to performance degradation for unpopular items. Inspired by this analysis, we propose simple-yet-effective linear autoencoder models using diagonal inequality constraints, called Relaxed Linear AutoEncoder (RLAE) and Relaxed Denoising Linear AutoEncoder (RDLAE). We prove that they generalize linear autoencoders by adjusting the degree of diagonal constraints. Experimental results demonstrate that our models are comparable or superior to state-of-the-art linear and non-linear models on six benchmark datasets; they significantly improve the accuracy of long-tail items. These results also support our theoretical insights on regularization and diagonal constraints in linear autoencoders.
Uncertainty quantification is one of the most crucial tasks to obtain trustworthy and reliable machine learning models for decision making. However, most research in this domain has only focused on problems with small label spaces and ignored eXtreme Multi-label Classification (XMC), which is an essential task in the era of big data for web-scale machine learning applications. Moreover, enormous label spaces could also lead to noisy retrieval results and intractable computational challenges for uncertainty quantification. In this paper, we aim to investigate general uncertainty quantification approaches for tree-based XMC models with a probabilistic ensemble-based framework. In particular, we analyze label-level and instance-level uncertainty in XMC, and propose a general approximation framework based on beam search to efficiently estimate the uncertainty with a theoretical guarantee under long-tail XMC predictions. Empirical studies on six large-scale real-world datasets show that our framework not only outperforms single models in predictive performance, but also can serve as strong uncertainty-based baselines for label misclassification and out-of-distribution detection, with significant speedup. Besides, our framework can further yield better state-of-the-art results based on deep XMC models with uncertainty quantification.
SESSION: Session 32 – Collaborative Filtering & Graph Neural Approaches
Bundle recommendation aims to recommend a bundle of items to users as a whole with user-bundle (U-B) interaction information, and auxiliary user-item (U-I) interaction and bundle-item affiliation information. Recent methods usually use two graph neural networks (GNNs) to model user’s bundle preferences separately from the U-B graph (bundle view) and U-I graph (item view). However, by conducting statistical analysis, we find that the auxiliary U-I information is far underexplored due to the following reasons: 1) Loosely combining the predicted results cannot well synthesize the knowledge from both views. 2) The local U-B and U-I collaborative relations might not be consistent, leading to GNN’s inaccurate modeling of user’s bundle preference from the U-I graph. 3) The U-I interactions are usually modeled equally while the significant ones corresponding to user’s bundle preference are less emphasized.
Based on these analyses, we propose a Distillation-enhanced Graph Masked AutoEncoder (DGMAE) for bundle recommendation. Our framework extracts the knowledge of first- and higher-order U-B relations from the U-B graph and injects it into a well-designed graph masked autoencoder (student model). The student model is built with two key designs to jointly capture significant local and global U-I relations from the U-I graph. In specific, we design a transformer-enhanced GNN encoder for global relation learning, which increases the model’s representational power of depicting user’s bundle preferences. Meanwhile, an adaptive edge masking strategy and reconstruction target are designed on the significant U-I edges to guide the student model to identify the potential ones suggesting user’s bundle preferences. Extensive experiments on benchmark datasets show the significant improvements of DGMAE over the SOTA methods.
Recently, Graph Neural Networks (GNNs) have become a mainstream recommender system method, where it captures high-order collaborative signals between nodes by performing convolution operations on the user-item interaction graph to predict user preferences for different items. However, in real scenarios, the user-item interaction graph is extremely sparse, which means numerous users only interact with a small number of items, resulting in the inability of GNN in learning high-quality node embeddings. To alleviate this problem, the Graph Contrastive Learning (GCL)-based recommender system method is proposed. GCL improves embedding quality by maximizing the similarity of the positive pair and minimizing the similarity of the negative pair. However, most GCL-based methods use heuristic data augmentation methods, i.e., random node/edge drop and attribute masking, to construct contrastive pairs, resulting in the loss of important information. To solve the problems in GCL-based methods, we propose a novel method, Candidate-aware Graph Contrastive Learning for Recommendation, called CGCL. In CGCL, we explore the relationship between the user and the candidate item in the embedding at different layers and use similar semantic embeddings to construct contrastive pairs. By our proposed CGCL, we construct structural neighbor contrastive learning objects, candidate contrastive learning objects, and candidate structural neighbor contrastive learning objects to obtain high-quality node embeddings. To validate the proposed model, we conducted extensive experiments on three publicly available datasets. Compared with various state-of-the-art DNN-, GNN- and GCL-based methods, our proposed CGCL achieved significant improvements in all indicators.
This paper presents a novel approach to representation learning in recommender systems by integrating generative self-supervised learning with graph transformer architecture. We highlight the importance of high-quality data augmentation with relevant self-supervised pretext tasks for improving performance. Towards this end, we propose a new approach that automates the self-supervision augmentation process through a rationale-aware generative SSL that distills informative user-item interaction patterns. The proposed recommender with Graph Transformer (GFormer) that offers parameterized collaborative rationale discovery for selective augmentation while preserving global-aware user-item relationships. In GFormer, we allow the rationale-aware SSL to inspire graph collaborative filtering with task-adaptive invariant rationalization in graph transformer. The experimental results reveal that our GFormer has the capability to consistently improve the performance over baselines on different datasets. Several in-depth experiments further investigate the invariant rationale-aware augmentation from various aspects. The source code for this work is publicly available at: https://github.com/HKUDS/GFormer.
SESSION: Session 33 – Attacks and Other Recommender Systems
Federated Recommender Systems (FedRecs) are considered privacy-preserving techniques to collaboratively learn a recommendation model without sharing user data. Since all participants can directly influence the systems by uploading gradients, FedRecs are vulnerable to poisoning attacks of malicious clients. However, most existing poisoning attacks on FedRecs are either based on some prior knowledge or with less effectiveness. To reveal the real vulnerability of FedRecs, in this paper, we present a new poisoning attack method to manipulate target items’ ranks and exposure rates effectively in the top-K recommendation without relying on any prior knowledge. Specifically, our attack manipulates target items’ exposure rate by a group of synthetic malicious users who upload poisoned gradients considering target items’ alternative products. We conduct extensive experiments with two widely used FedRecs (Fed-NCF and Fed-LightGCN) on two real-world recommendation datasets. The experimental results show that our attack can significantly improve the exposure rate of unpopular target items with extremely fewer malicious users and fewer global epochs than state-of-the-art attacks. In addition to disclosing the security hole, we design a novel countermeasure for poisoning attacks on FedRecs. Specifically, we propose a hierarchical gradient clipping with sparsified updating to defend against existing poisoning attacks. The empirical results demonstrate that the proposed defending mechanism improves the robustness of FedRecs.
Neural ranking models (NRMs) have attracted considerable attention in information retrieval. Unfortunately, NRMs may inherit the adversarial vulnerabilities of general neural networks, which might be leveraged by black-hat search engine optimization practitioners. Recently, adversarial attacks against NRMs have been explored in the paired attack setting, generating an adversarial perturbation to a target document for a specific query. In this paper, we focus on a more general type of perturbation and introduce the topic-oriented adversarial ranking attack task against NRMs, which aims to find an imperceptible perturbation that can promote a target document in ranking for a group of queries with the same topic. We define both static and dynamic settings for the task and focus on decision-based black-box attacks. We propose a novel framework to improve topic-oriented attack performance based on a surrogate ranking model. The attack problem is formalized as a Markov decision process (MDP) and addressed using reinforcement learning. Specifically, a topic-oriented reward function guides the policy to find a successful adversarial example that can be promoted in rankings to as many queries as possible in a group. Experimental results demonstrate that the proposed framework can significantly outperform existing attack strategies, and we conclude by re-iterating that there exist potential risks for applying NRMs in the real world.
RCENR: A Reinforced and Contrastive Heterogeneous Network Reasoning Model for Explainable News Recommendation
Existing news recommendation methods suffer from sparse and weak interaction data, leading to reduced effectiveness and explainability. Knowledge reasoning, which explores inferential trajectories in the knowledge graph, can alleviate data sparsity and provide explicitly recommended explanations. However, brute-force pre-processing approaches used in conventional methods are not suitable for fast-changing news recommendation. Therefore, we propose an explainable news recommendation model: the Reinforced and Contrastive Heterogeneous Network Reasoning Model for Explainable News Recommendation (RCENR), consisting of NHN-R2 and MR&CO frameworks. The NHN-R2 framework generates user/news subgraphs to enhance recommendation and extend the dimensions and diversity of reasoning. The MR&CO framework incorporates contrastive learning with a reinforcement-based strategy for self-supervised and efficient model training. Experiments on the MIND dataset show that RCENR is able to improve recommendation accuracy and provide diverse and credible explanations.
Recent years have witnessed the rapid development of heterogeneous graph neural networks (HGNNs) in information retrieval (IR) applications. Many existing HGNNs design a variety of tailor-made graph convolutions to capture structural and semantic information in heterogeneous graphs. However, existing HGNNs usually represent each node as a single vector in the multi-layer graph convolution calculation, which makes the high-level graph convolution layer fail to distinguish information from different relations and different orders, resulting in the information loss in the message passing. Then we propose a novel heterogeneous graph neural network with sequential node representation, namely Seq-HGNN. To avoid the information loss caused by the single vector node representation, we first design a sequential node representation learning mechanism to represent each node as a sequence of meta-path representations during the node message passing. Then we propose a heterogeneous representation fusion module, empowering Seq-HGNN to identify important meta-paths and aggregate their representations into a compact one. We conduct extensive experiments on four widely used datasets from Heterogeneous Graph Benchmark (HGB) and Open Graph Benchmark (OGB). Experimental results show that our proposed method outperforms state-of-the-art baselines in both accuracy and efficiency. The source code is available at https://github.com/nobrowning/SEQ_HGNN.
SESSION: Short Research Papers
Query-focused summarization (QFS) aims to provide a summary of a document that satisfies information need of a given query and is useful in various IR applications, such as abstractive snippet generation. Current QFS approaches typically involve injecting additional information, e.g. query-answer relevance or fine-grained token-level interaction between a query and document, into a finetuned large language model. However, these approaches often require extra parameters & training, and generalize poorly to new dataset distributions. To mitigate this, we propose leveraging a recently developed constrained generation model Neurological Decoding (NLD) as an alternative to current QFS regimes which rely on additional sub-architectures and training. We first construct lexical constraints by identifying important tokens from the document using a lightweight gradient attribution model, then subsequently force the generated summary to satisfy these constraints by directly manipulating the final vocabulary likelihood. This lightweight approach requires no additional parameters or finetuning as it utilizes both an off-the-shelf neural retrieval model to construct the constraints and a standard generative language model to produce the QFS. We demonstrate the efficacy of this approach on two public QFS collections achieving near parity with the state-of-the-art model with substantially reduced complexity.
Automatically generating controllable and diverse mathematical word problems (MWPs) which conform to equations and topics is a crucial task in information retrieval and natural language generation. Recent deep learning models mainly focus on improving the problem readability but overlook the mathematical logic coherence, which tends to generate unsolvable problems. In this paper, we draw inspiration from the human problem-designing process and propose a Mathematical structure Planning and Knowledge enhanced Generation model (MaPKG), following the “plan-then-generate” steps. Specifically, we propose a novel dynamic planning module to make sentence-level equation plans and a dual-attention mechanism for word-level generation, incorporating equation structure representation and external commonsense knowledge. Extensive experiments on two MWP datasets show our model can guarantee more solvable, high-quality, and diverse problems. Our code is available at https://github.com/KenelmQLH/MaPKG.git
Unwanted social biases are usually encoded in pretrained language models (PLMs). Recent efforts are devoted to mitigating intrinsic bias encoded in PLMs. However, the separate fine-tuning on applications is detrimental to intrinsic debiasing. A bias resurgence issue arises when fine-tuning the debiased PLMs on downstream tasks. To eliminate undesired stereotyped associations in PLMs during fine-tuning, we present a mixup-based framework Mix-Debias from a new unified perspective, which directly combines debiasing PLMs with fine-tuning applications. The key to Mix-Debias is applying mixup-based linear interpolation on counterfactually augmented downstream datasets, with expanded pairs from external corpora. Besides, we devised an alignment regularizer to ensure original augmented pairs and gender-balanced counterparts are spatially closer. Experimental results show that Mix-Debias can reduce biases in PLMs while maintaining a promising performance in applications.
A Model-Agnostic Popularity Debias Training Framework for Click-Through Rate Prediction in Recommender System
Recommender system (RS) is widely applied in a multitude of scenarios to aid individuals obtaining the information they require efficiently. At the same time, the prevalence of popularity bias in such systems has become a widely acknowledged issue. To address this challenge, we propose a novel method named Model-Agnostic Popularity Debias Training Framework (MDTF). It consists of two basic modules including 1) General Ranking Model (GRM), which is model-agnostic and can be implemented as any ranking models; and 2) Popularity Debias Module (PDM), which estimates the impact of the competitiveness and popularity of candidate items on the CTR, by utilizing the feedback of cold-start users to re-weigh the loss in GRM. MDTF seamlessly integrates these two modules in an end-to-end multi-task learning framework. Extensive experiments on both real-world offline dataset and online A/B test demonstrate its superiority over state-of-the-art methods.
The pre-training and fine-tuning paradigm has become the main-stream framework in the field of Aspect-Based Sentiment Analysis (ABSA). Although it has achieved sound performance in the domains containing enough fine-grained aspect-sentiment annotations, it is still challenging to conduct few-shot ABSA in domains where manual annotations are scarce. In this work, we argue that two kinds of gaps, i.e., domain gap and objective gap, hinder the transfer of knowledge from pre-training language models (PLMs) to ABSA tasks. To address this issue, we introduce a simple yet effective framework called FS-ABSA, which involves domain-adaptive pre-training and text-infilling fine-tuning. We approach the End-to-End ABSA task as a text-infilling problem and perform domain-adaptive pre-training with the text-infilling objective, narrowing the two gaps and consequently facilitating the knowledge transfer. Experiments show that the resulting model achieves more compelling performance than baselines under the few-shot setting while driving the state-of-the-art performance to a new level across datasets under the fully-supervised setting. Moreover, we apply our framework to two non-English low-resource languages to demonstrate its generality and effectiveness.
Sparse neural retrievers, such as DeepImpact, uniCOIL and SPLADE, have been introduced recently as an efficient and effective way to perform retrieval with inverted indexes. They aim to learn term importance and, in some cases, document expansions, to provide a more effective document ranking compared to traditional bag-of-words retrieval models such as BM25. However, these sparse neural retrievers have been shown to increase the computational costs and latency of query processing compared to their classical counterparts. To mitigate this, we apply a well-known family of techniques for boosting the efficiency of query processing over inverted indexes: static pruning. We experiment with three static pruning strategies, namely document-centric, term-centric and agnostic pruning, and we assess, over diverse datasets, that these techniques still work with sparse neural retrievers. In particular, static pruning achieves 2x speedup with negligible effectiveness loss (≤ 2% drop) and, depending on the use case, even 4x speedup with minimal impact on the effectiveness (≤ 8% drop). Moreover, we show that neural rerankers are robust to candidates from statically pruned indexes.
A Unified Formulation for the Frequency Distribution of Word Frequencies using the Inverse Zipf’s Law
The power-law approximation for the frequency distribution of words postulated by Zipf has been extensively studied for decades, which led to many variations on the theme. However, comparatively less attention has been paid to the investigation of the case of word frequencies. In this paper, we derive its analytical expression from the inverse of the underlying rank-size distribution as a function of total word count, vocabulary size and the shape parameter, thereby providing a unified framework to explain the nonlinear behavior of low frequencies on the log-log scale. We also present an efficient method based on relative entropy minimization for a robust estimation of the shape parameter using a small number of empirical low-frequency probabilities. Experiments were carried out for a selected set of languages with varying degrees of inflection in order to demonstrate the effectiveness of the proposed approach.
Learned sparse retrieval (LSR) is a family of neural retrieval methods that transform queries and documents into sparse weight vectors aligned with a vocabulary. While LSR approaches like Splade work well for short passages, it is unclear how well they handle longer documents. We investigate existing aggregation approaches for adapting LSR to longer documents and find that proximal scoring is crucial for LSR to handle long documents. To leverage this property, we proposed two adaptations of the Sequential Dependence Model (SDM) to LSR: ExactSDM and SoftSDM. ExactSDM assumes only exact query term dependence, while SoftSDM uses potential functions that model the dependence of query terms and their expansion terms (i.e., terms identified using a transformer’s masked language modeling head).
Experiments on the MSMARCO Document and TREC Robust04 datasets demonstrate that both ExactSDM and SoftSDM outperform existing LSR aggregation approaches for different document length constraints. Surprisingly, SoftSDM does not provide any performance benefits over ExactSDM. This suggests that soft proximity matching is not necessary for modeling term dependence in LSR. Overall, this study provides insights into handling long documents with LSR, proposing adaptations that improve its performance.
Large-scale commercial platforms usually involve numerous business scenarios for diverse business strategies. To provide click-through rate (CTR) predictions for multiple scenarios simultaneously, existing promising multi-scenario models explicitly construct scenario-specific networks by manually grouping scenarios based on particular business strategies. Nonetheless, this pre-defined data partitioning process heavily relies on prior knowledge, and it may neglect the underlying data distribution of each scenario, hence limiting the model’s representation capability. Regarding the above issues, we propose Adaptive Distribution Learning (ADL): an end-to-end optimization distribution framework which is composed of a clustering process and classification process. Specifically, we design a distribution adaptation module with a customized dynamic routing mechanism. Instead of introducing prior knowledge for pre-defined data allocation, this routing algorithm adaptively provides a distribution coefficient for each sample to determine which cluster it belongs to. Each cluster corresponds to a particular distribution so that the model can sufficiently capture the commonalities and distinctions between these distinct clusters. Our results on both public and large-scale industrial datasets show the effectiveness and efficiency of ADL: the model yields impressive prediction accuracy with more than 50% reduction in time cost during the training phase when compared to other methods.
Intent detection plays an essential role in dialogue systems. This paper takes the lead to study open compound domain adaptation (OCDA) for intent detection, which brings the advantage of improved generalization to unseen domains. OCDA for intent detection is indeed a more realistic domain adaptation setting, which learns an intent classifier from labeled source domains and adapts it to unlabeled compound target domains containing different intent classes with the source domains. At inference time, we test the intent classifier in open domains that contain previously unseen intent classes. To this end, we propose an Adversarial Meta Prompt Tuning method (called AMPT) for open compound domain adaptive intent detection. Concretely, we propose a meta prompt tuning method, which utilizes language prompts to elicit rich knowledge from large-scale pre-trained language models (PLMs) and automatically finds better prompt initialization that facilitates fast adaptation via meta learning. Furthermore, we leverage a domain adversarial training technique to acquire domain-invariant representations of diverse domains. By taking advantage of the collaborative effect of meta learning, prompt tuning, and adversarial training, we can learn an intent classifier that can effectively generalize to unseen open domains. Experimental results on two benchmark datasets (i.e., HWU64 and CLINC) show that our model can learn substantially better-generalized representations for unseen domains compared with strong competitors.
Information retrieval (IR) relies on a general notion of relevance, which is used as the principal foundation for ranking and evaluation methods. However, IR does not account for more a nuanced affective experience. Here, we consider the emotional response decoded directly from the human brain as an alternative dimension of relevance. We report an experiment covering seven different scenarios in which we measure and predict how users emotionally respond to visual image contents by using functional near-infrared spectroscopy (fNIRS) neuroimaging on two commonly used affective dimensions: valence (negativity and positivity) and arousal (boredness and excitedness). Our results show that affective states can be successfully decoded using fNIRS, and utilized to complement the present notion of relevance in IR studies. For example, we achieved 0.39 Balanced accuracy and 0.61 AUC in 4-class classification of affective states (vs. 0.25 Balanced accuracy and 0.5 AUC of a random classifier). Likewise, we achieved 0.684 Precision@20 when retrieving high-arousal images. Our work opens new avenues for incorporating emotional states in IR evaluation, affective feedback, and information filtering.
Recommender systems often suffer from popularity bias, where popular items are overly recommended while sacrificing unpopular items. Existing researches generally focus on ensuring the number of recommendations (exposure) of each item is equal or proportional, using inverse propensity weighting, causal intervention, or adversarial training. However, increasing the exposure of unpopular items may not bring more clicks or interactions, resulting in skewed benefits and failing in achieving real reasonable popularity debiasing. In this paper, we propose a new criterion for popularity debiasing, i.e., in an unbiased recommender system, both popular and unpopular items should receive Interactions Proportional to the number of users who Like it, namely IPL criterion. Under the guidance of the criterion, we then propose a debiasing framework with IPL regularization term which is theoretically shown to achieve a win-win situation of both popularity debiasing and recommendation performance. Experiments conducted on four public datasets demonstrate that when equipping two representative collaborative filtering models with our framework, the popularity bias is effectively alleviated while maintaining the recommendation performance.
CTR prediction is crucial in recommendation systems and online advertising platforms, where user-generated data streams that drift over time can lead to catastrophic forgetting if the model continuously adapts to new data distribution. Conventional strategies for catastrophic forgetting are challenging to deploy due to memory constraints and diverse data distributions. To address this, we propose a novel drift-aware incremental learning framework based on ensemble learning for CTR prediction, which uses explicit error-based drift detection on streaming data to strengthen well-adapted ensembles and freeze ensembles that do not match the input distribution, avoiding catastrophic interference. Our method outperforms all baselines considered in offline experiments and A/B tests.
Recently, a series of pioneer studies have shown the potency of pre-trained models in sequential recommendation, illuminating the path of building an omniscient unified pre-trained recommendation model for different downstream recommendation tasks. Despite these advancements, the vulnerabilities of classical recommender systems also exist in pre-trained recommendation in a new form, while the security of pre-trained recommendation model is still unexplored, which may threaten its widely practical applications. In this study, we propose a novel framework for backdoor attacking in pre-trained recommendation. We demonstrate the provider of the pre-trained model can easily insert a backdoor in pre-training, thereby increasing the exposure rates of target items to target user groups. Specifically, we design two novel and effective backdoor attacks: basic replacement and prompt-enhanced, under various recommendation pre-training usage scenarios. Experimental results on real-world datasets show that our proposed attack strategies significantly improve the exposure rates of target items to target users by hundreds of times in comparison to the clean model. The source codes are released in https://github.com/wyqing20/APRec.
The main idea of multimodal recommendation is the rational utilization of the item’s multimodal information to improve the recommendation performance. Previous works directly integrate item multimodal features with item ID embeddings, ignoring the inherent semantic relations contained in the multimodal features. In this paper, we propose a novel and effective aTtention-guided Multi-step FUsion Network for multimodal recommendation, named TMFUN. Specifically, our model first constructs modality feature graph and item feature graph to model the latent item-item semantic structures. Then, we use the attention module to identify inherent connections between user-item interaction data and multimodal data, evaluate the impact of multimodal data on different interactions, and achieve early-step fusion of item features. Furthermore, our model optimizes item representation through the attention-guided multi-step fusion strategy and contrastive learning to improve recommendation performance. The extensive experiments on three real-world datasets show that our model has superior performance compared to the state-of-the-art models.
Transformers emerged as powerful methods for sequential recommendation. However, existing architectures often overlook the complex dependencies between user preferences and the temporal context. In this short paper, we introduce MOJITO, an improved Transformer sequential recommender system that addresses this limitation. MOJITO leverages Gaussian mixtures of attention-based temporal context and item embedding representations for sequential modeling. Such an approach permits to accurately predict which items should be recommended next to users depending on past actions and the temporal context. We demonstrate the relevance of our approach, by empirically outperforming existing Transformers for sequential recommendation on several real-world datasets.
Effective cross-lingual dense retrieval methods that rely on multilingual pre-trained language models (PLMs) need to be trained to encompass both the relevance matching task and the cross-language alignment task. However, cross-lingual data for training is often scarcely available. In this paper, rather than using more cross-lingual data for training, we propose to use cross-lingual query generation to augment passage representations with queries in languages other than the original passage language. These augmented representations are used at inference time so that the representation can encode more information across the different target languages. Training of a cross-lingual query generator does not require additional training data to that used for the dense retriever. The query generator training is also effective because the pre-training task for the generator (T5 text-to-text training) is very similar to the fine-tuning task (generation of a query). The use of the generator does not increase query latency at inference and can be combined with any cross-lingual dense retrieval method. Results from experiments on a benchmark cross-lingual information retrieval dataset show that our approach can improve the effectiveness of existing cross-lingual dense retrieval methods. Implementation of our methods, along with all generated query files are made publicly available at https://github.com/ielab/xQG4xDR.
Deep recommender systems typically involve numerous feature fields for users and items, with a large number of low-frequency features. These low-frequency features would reduce the prediction accuracy with large storage space due to their vast quantity and inadequate training. Some pioneering studies have explored embedding compression techniques to address this issue of the trade-off between storage space and model predictability. However, these methods have difficulty compacting the embedding of low-frequency features in various feature fields due to the high demand for human experience and computing resources during hyper-parameter searching. In this paper, we propose the AutoDPQ framework, which automatically compacts low-frequency feature embeddings for each feature field to an adaptive magnitude. Experimental results indicate that AutoDPQ can significantly reduce the parameter space while improving recommendation accuracy. Moreover, AutoDPQ is compatible with various deep CTR models by improving their performance significantly with high efficiency.
Conversational recommender systems (CRS) enhance the expressivity and personalization of recommendations through multiple turns of user-system interaction. Critiquing is a well-known paradigm for CRS that allows users to iteratively refine recommendations by providing feedback about attributes of recommended items. While existing critiquing methodologies utilize direct attributes of items to address user requests such as ‘I prefer Western movies’, the opportunity of incorporating richer contextual and side information about items stored in Knowledge Graphs (KG) into the critiquing paradigm has been overlooked. Employing this substantial knowledge together with a well-established reasoning methodology paves the way for critique-based recommenders to allow for complex knowledge-based feedback (e.g., ‘I like movies featuring war side effects on veterans’) which may arise in natural user-system conversations. In this work, we aim to increase the flexibility of critique-based recommendation by integrating KGs and propose a novel Bayesian inference framework that enables reasoning with relational knowledge-based feedback. We study and formulate the framework considering a Gaussian likelihood and evaluate it on two well-known recommendation datasets with KGs. Our evaluations demonstrate the effectiveness of our framework in leveraging indirect KG-based feedback (i.e., preferred relational properties of items rather than preferred items themselves), often improving personalized recommendations over a one-shot recommender by more than 15%. This work enables a new paradigm for using rich knowledge content and reasoning over indirect evidence as a mechanism for critiquing interactions with CRS.
With the increasing popularity of location-based services, the point-of-interest (POI) search has received considerable attention in recent years. Existing studies on POI search mostly focus on how to construct better retrieval models to retrieve the relevant POI based on query-POI matching. However, user behavior in POI search, i.e., how users examine the search engine result page (SERP), is mostly underexplored. A good understanding of user behavior is well-recognized as a key to develop effective user models and retrieval models to improve the search quality. Therefore, in this paper, we propose to investigate user behavior in POI search with a lab study in which users’ eye movements and their implicit feedback on the SERP are collected. Based on the collected data, we analyze (1) query-level user behavior patterns in POI search, i.e., examination and interactions on SERP; (2) session-level user behavior patterns in POI search, i.e., query reformulation, termination of search, etc. Our work sheds light on user behavior in POI search and could potentially benefit future studies on related research topics.
Middle training methods aim to bridge the gap between the Masked Language Model (MLM) pre-training and the final finetuning for retrieval. Recent models such as CoCondenser, RetroMAE, and LexMAE argue that the MLM task is not sufficient enough to pre-train a transformer network for retrieval and hence propose various tasks to do so. Intrigued by those novel methods, we noticed that all these models used different finetuning protocols, making it hard to assess the benefits of middle training. We propose in this paper a benchmark of CoCondenser, RetroMAE, and LexMAE, under the same finetuning conditions. We compare both dense and sparse approaches under various finetuning protocols and middle training on different collections (MS MARCO, Wikipedia). We use additional middle training baselines, such as a standard MLM finetuning on the retrieval collection, optionally augmented by a CLS predicting the passage term frequency. For the sparse approach, our study reveals that there is almost no statistical difference between those methods: the more effective the finetuning procedure is, the less difference there is between those models. For the dense approach, RetroMAE using MS MARCO as middle-training collection shows excellent results in almost all the settings. Finally, we show that middle training on the retrieval collection, thus adapting the language model to it, is a critical factor. Overall, a better experimental setup should be adopted to evaluate middle training methods.
Biomedical Named Entity Recognition (BioNER) is the fundamental task of identifying named entities from biomedical text. However, BioNER suffers from severe data scarcity and lacks high-quality labeled data due to the highly specialized and expert knowledge required for annotation. Though data augmentation has shown to be highly effective for low-resource NER in general, existing data augmentation techniques fail to produce factual and diverse augmentations for BioNER. In this paper, we present BioAug, a novel data augmentation framework for low-resource BioNER. BioAug, built on BART, is trained to solve a novel text reconstruction task based on selective masking and knowledge augmentation. Post training, we perform conditional generation and generate diverse augmentations conditioning BioAug on selectively corrupted text similar to the training stage. We demonstrate the effectiveness of BioAug on 5 benchmark BioNER datasets and show that BioAug outperforms all our baselines by a significant margin (1.5%-21.5% absolute improvement) and is able to generate augmentations that are both more factual and diverse. Code: https://github.com/Sreyan88/BioAug.
Prediction models for click-through rate (CTR) learn feature interactions underlying user behaviors, which are crucial in recommendation systems. Due to their size and complexity, existing approaches have a limited range of applications. In order to decrease inference delay, knowledge distillation techniques have been used in recommendation systems. Due to the student model’s lower capacity, the knowledge distillation process is less effective when there is a significant difference in the complexity of the network architecture between the teacher model and the student model.
We present a novel knowledge distillation approach called Bridge-based Knowledge Distillation (BKD), which employs a bridge model to facilitate the student model’s learning from the teacher model’s latent representations. The bridge model is based on Graph Neural Networks (GNNs), and leverages the edges of GNNs to identify significant feature interaction relationships, while simultaneously reducing redundancy for improved efficiency. To further enhance the efficiency of knowledge distillation, we decoupled the extracted knowledge and transferred each component separately to the student model, aiming to improve the distillation sufficiency of each module. Extensive experimental results show that our proposed BKD approach outperforms state-of-the-art competitors on various tasks.
In the field of E-commerce, the rapid introduction of new products poses challenges for product description generation. Traditional approaches rely on large labelled datasets, which are often unavailable for novel products with limited data. To address this issue, we propose a calibration learning approach for few-shot novel product description. Our method leverages a small amount of labelled data for calibration and utilizes the novel product’s semantic representation as prompts to generate accurate and informative descriptions. We evaluate our approach on three large-scale e-commerce datasets of novel products and demonstrate its effectiveness in significantly improving the quality of generated product descriptions compared to existing methods, especially when only limited data is available. We also conduct the analysis to understand the impact of different modules on the performance.
This paper explores the utility of a Large Language Model (LLM) to automatically generate queries and query variants from a description of an information need. Given a set of information needs described as backstories, we explore how similar the queries generated by the LLM are to those generated by humans. We quantify the similarity using different metrics and examine how the use of each set would contribute to document pooling when building test collections. Our results show potential in using LLMs to generate query variants. While they may not fully capture the wide variety of human-generated variants, they generate similar sets of relevant documents, reaching up to 71.1% overlap at a pool depth of 100.
Recommendation models are typically trained on observational user interaction data, but the interactions between latent factors in users’ decision-making processes lead to complex and entangled data. Disentangling these latent factors to uncover their underlying representation can improve the robustness, interpretability, and controllability of recommendation models. This paper introduces the Causal Disentangled Variational Auto-Encoder (CaD-VAE), a novel approach for learning causal disentangled representations from interaction data in recommender systems. The CaD-VAE method considers the causal relationships between semantically related factors in real-world recommendation scenarios, rather than enforcing independence as in existing disentanglement methods. The approach utilizes structural causal models to generate causal representations that describe the causal relationship between latent factors. The results demonstrate that CaD-VAE outperforms existing methods, offering a promising solution for disentangling complex user behavior data in recommendation systems.
Most e-commerce platforms consist of multiple entries (e.g., recommendation, search, shopping cart and etc.) for users to purchase their liked items. Among the research on the recommendation entry, most of them focus on improving the conversion volumes merely in the recommendation entry. However, such way could not ensure an increase in the global conversion volumes of the e-commerce platform. To achieve this goal by optimizing the recommendation entry only, in this paper, we focus on modeling the causality between the recommendation-entry-impression and the conversion by proposing the two-stage Causality Enhanced Conversion (CEC) model. In the first stage, we define the recommendation-entry-impression as treatment, then we estimate the conversion rate conditioned on the inclusion or exclusion of treatment respectively and calculate the corresponding individual treatment effect (ITE). In the second stage, we propose a propensity-normalization (PN) based method to transform the learned ITE to a weight term for instance weighting in the conversion loss. Extensive offline and online experiments on a large-scale food e-commerce scenario demonstrate that the CEC model could focus more on those conversed instances that can improve the global conversion volumes of the platform.
Position bias, the phenomenon whereby users tend to focus on higher-ranked items of the search result list regardless of the actual relevance to queries, is prevailing in many ranking systems. Position bias in training data biases the ranking model, leading to increasingly unfair item rankings, click-through-rate (CTR), and conversion rate (CVR) predictions. To jointly mitigate position bias in both item CTR and CVR prediction, we propose two position-bias-free CTR and CVR prediction models: Position-Aware Click-Conversion (PACC) and PACC via Position Embedding (PACC-PE). PACC is built upon probability decomposition and models position information as a probability. PACC-PE utilizes neural networks to model product-specific position information as embedding. Experiments on the E-commerce sponsored product search dataset show that our proposed models have better ranking effectiveness and can greatly alleviate position bias in both CTR and CVR prediction.
Popularity bias in recommendation lists refers to over-representation of popular content and is a challenge for many recommendation algorithms. Previous research has suggested several offline metrics to quantify popularity bias, which commonly relate the popularity of items in users’ recommendation lists to the popularity of items in their interaction history. Discrepancies between these two factors are referred to as popularity miscalibration. While popularity metrics provide a straightforward and well-defined means to measure popularity bias, it is unknown whether they actually reflect users’ perception of popularity bias.
To address this research gap, we conduct a crowd-sourced user study on Prolific, involving 56 participants, to (1) investigate whether the level of perceived popularity miscalibration differs between common recommendation algorithms, (2) assess the correlation between perceived popularity miscalibration and its corresponding quantification according to a common offline metric. We conduct our study in a well-defined and important domain, namely music recommendation using the standardized LFM-2b dataset, and quantify popularity miscalibration of five recommendation algorithms by utilizing Jensen-Shannon distance (JSD). Challenging the findings of previous studies, we observe that users generally do perceive significant differences in terms of popularity bias between algorithms if this bias is framed as popularity miscalibration. In addition, JSD correlates moderately with users’ perception of popularity, but not with their perception of unpopularity.
As web applications continue to expand and diversify their services, user interactions exist in different scenarios. To leverage this wealth of information, cross-domain recommendation (CDR) has gained significant attention in recent years. However, existing CDR approaches mostly focus on information transfer between observed domains, with little attention paid to generalizing to unseen domains. Although recent research on invariant learning can help for the purpose of generalization, relying only on invariant preference may be overly conservative and result in mediocre performance when the unseen domain shifts slightly. In this paper, we present a novel framework that considers both CDR and domain generalization through a united causal invariant view. We assume that user interactions are determined by domain-invariant preference and domain-specific preference. The proposed approach differentiates the invariant preference and the specific preference from observational behaviors in a way of adversarial learning. Additionally, a novel domain routing module is designed to connect unseen domains to observed domains. Extensive experiments on public and industry datasets have proved the effectiveness of the proposed approach under both CDR and domain generalization settings.
Query reformulation is a key mechanism to alleviate the linguistic chasm of query in ad-hoc retrieval. Among various solutions, query reduction effectively removes extraneous terms and specifies concise user intent from long queries. However, it is challenging to capture hidden and diverse user intent. This paper proposes Contextualized Query Reduction (ConQueR) using a pre-trained language model (PLM). Specifically, it reduces verbose queries with two different views: core term extraction and sub-query selection. One extracts core terms from an original query at the term level, and the other determines whether a sub-query is a suitable reduction for the original query at the sequence level. Since they operate at different levels of granularity and complement each other, they are finally aggregated in an ensemble manner. We evaluate the reduction quality of ConQueR on real-world search logs collected from a commercial web search engine. It achieves up to 8.45% gains in exact match scores over the best competing model.
Click-through rate (CTR) prediction plays a crucial role in industrial recommendation and advertising systems, which generate and expose multiple items for each user request. Although the user’s click action on an item will be affected by the other exposed items (called contextual items), current CTR prediction methods do not exploit this context because CTR prediction is performed before the contextual items are generated. This paper introduces a solution Contextual Items Simulation and Modeling (CISM) to tackle this limitation. Specifically, we propose a near-line Context Simulation Center to simulate exposure page without affecting online service latency, and an online Context Modeling Transformer to learn user-wise context from the simulated results w.r.t. the candidate item. In addition, knowledge distillation is introduced to further improve CTR prediction. Extensive experiments on both public and industrial datasets demonstrate the effectiveness of CISM. Currently, CISM has been deployed in the online display advertising system of Meituan Waimai, serving the main traffic.
Conversion rate (CVR) prediction plays an important role in advertising systems. Recently, supervised deep neural network-based models have shown promising performance in CVR prediction. However, they are data hungry and require an enormous amount of training data. In online advertising systems, although there are millions to billions of ads, users tend to click only a small set of them and to convert on an even smaller set. This data sparsity issue restricts the power of these deep models. In this paper, we propose the Contrastive Learning for CVR prediction (CL4CVR) framework. It associates the supervised CVR prediction task with a contrastive learning task, which can learn better data representations exploiting abundant unlabeled data and improve the CVR prediction performance. To tailor the contrastive learning task to the CVR prediction problem, we propose embedding masking (EM), rather than feature masking, to create two views of augmented samples. We also propose a false negative elimination (FNE) component to eliminate samples with the same feature as the anchor sample, to account for the natural property in user behavior data. We further propose a supervised positive inclusion (SPI) component to include additional positive samples for each anchor sample, in order to make full use of sparse but precious user conversion events. Experimental results on two real-world conversion datasets demonstrate the superior performance of CL4CVR. The source code is available at https://github.com/DongRuiHust/CL4CVR.
Multi-task learning for various real-world applications usually involves tasks with logical sequential dependence. For example, in online marketing, the cascade behavior pattern of impression \rightarrow click \rightarrow conversion is usually modeled as multiple tasks in a multi-task manner, where the sequential dependence between tasks is simply connected with an explicitly defined function or implicitly transferred information in current works. These methods alleviate the data sparsity problem for long-path sequential tasks as the positive feedback becomes sparser along with the task sequence. However, the error accumulation and negative transfer will be a severe problem for downstream tasks. Especially, at the beginning stage of training, the optimization for parameters of former tasks is not converged yet, and thus the information transferred to downstream tasks is negative. In this paper, we propose a prior information merged model (PIMM), which explicitly models the logical dependence among tasks with a novel prior information merged (PIM) module for multiple sequential dependence task learning in a curriculum manner. Specifically, the PIM randomly selects the true label information or the prior task prediction with a soft sampling strategy to transfer to the downstream task during the training. Following an easy-to-difficult curriculum paradigm, we dynamically adjust the sampling probability to ensure that the downstream task will get the effective information along with the training. The offline experimental results on both public and product datasets verify that PIMM outperforms state-of-the-art baselines. Moreover, we deploy the PIMM in a large-scale FinTech platform, and the online experiments also demonstrate the effectiveness of PIMM.
Incremental Named Entity Recognition (INER) aims to continually train a model with new data, recognizing emerging entity types without forgetting previously learned ones. Prior INER methods have shown that Logits Distillation (LD), which involves preserving predicted logits via knowledge distillation, effectively alleviates this challenging issue. In this paper, we discover that a predicted logit can be decomposed into two terms that measure the likelihood of an input token belonging to a specific entity type or not. However, the traditional LD only preserves the sum of these two terms without considering the change in each component. To explicitly constrain each term, we propose a novel Decomposing Logits Distillation (DLD) method, enhancing the model’s ability to retain old knowledge and mitigate catastrophic forgetting. Moreover, DLD is model-agnostic and easy to implement. Extensive experiments show that DLD consistently improves the performance of state-of-the-art INER methods across ten INER settings in three datasets.
While the integration of product images enhances the recommendation performance of visual-based recommender systems (VRSs), this can make the model vulnerable to adversaries that can produce noised images capable to alter the recommendation behavior. Recently, stronger and stronger adversarial attacks have emerged to raise awareness of these risks; however, effective defense methods are still an urgent open challenge. In this work, we propose “Adversarial Image Denoiser” (AiD), a novel defense method that cleans up the item images by malicious perturbations. In particular, we design a training strategy whose denoising objective is to minimize both the visual differences between clean and adversarial images and preserve the ranking performance in authentic settings. We perform experiments to evaluate the efficacy of AiD using three state-of-the-art adversarial attacks mounted against standard VRSs. Code and datasets at https://github.com/sisinflab/Denoise-to-protect-VRS.
Recently, Graph neural networks (GNNs) have been adopted to model a wide range of structured data from academic and industry fields. With the rapid development of Internet technology, there are more and more meaningful applications for Internet devices, including device identification, geolocation and others, whose performance needs improvement. To replicate the several claimed successes of GNNs, this paper proposes DeviceGPT based on a generative pre-training transformer on a heterogeneous graph via self-supervised learning to learn interactions-rich information of devices from its large-scale databases well. The experiments on the dataset constructed from the real world show DeviceGPT could achieve competitive results in multiple Internet applications.
Neural knowledge models emerged and advanced common-sense-centric knowledge grounding. They parameterize a small seed curated commonsense knowledge graph (CS-KG) in a language model to generalize more. A current trend is to scale the seed up by directly mixing multiple sources of CS-KG (e.g., ATOMIC, ConceptNet) into one model. But, such brute-force mixing inevitably hinders effective knowledge consolidation due to i) ambiguous, polysemic, and/or inconsistent relations across sources and ii) knowledge learned in an entangled manner despite distinct types (e.g., causal, temporal). To mitigate this, we adopt a concept of commonsense knowledge dimension and propose a brand-new dimension-disentangled knowledge model (D2KM) learning paradigm with multiple sources. That is, a generative language model with dimension-specific soft prompts is trained to disentangle knowledge acquisitions along with different dimensions and facilitate potential intra-dimension consolidation across CS-KG sources. Experiments show our knowledge model outperforms its baselines in both standard and zero-shot scenarios.
Conversation disentanglement aims to identify and group utterances from a conversation into separate threads. Existing methods primarily focus on disentangling multi-party conversations with three or more speakers, explicitly or implicitly incorporating speaker-related feature signals to disentangle. Most existing models require a large amount of human annotated data for model training, and often focus on pairwise relations between utterances, not accounting much for the conversational context. In this work, we propose a multi-task learning approach with a contrastive learning objective, DiSC, to disentangle conversations between two speakers — a user and a virtual speech assistant, for a novel domain of e-commerce. We analyze multiple ways and granularities to define conversation “threads”. DiSC jointly learns the relation between pairs of utterances, as well as between utterances and their respective thread context. We train and evaluate our models on multiple multi-threaded conversation datasets that were automatically created, without any human labeling effort. Experimental results on public datasets as well as real-world shopping conversations from a commercial speech assistant show that DiSC outperforms state-of-the-art baselines by at least 3%, across both automatic and human evaluation metrics. We also demonstrate how DiSC improves downstream dialog response generation in the shopping domain.
Advances in Visually Rich Document Understanding (VrDU) have enabled information extraction and question answering over documents with complex layouts. Two tropes of architectures have emerged-transformer-based models inspired by LLMs, and Graph Neural Networks. In this paper, we introduce DocGraphLM, a novel framework that combines pre-trained language models with graph semantics. To achieve this, we propose 1) a joint encoder architecture to represent documents, and 2) a novel link prediction approach to reconstruct document graphs. DocGraphLM predicts both directions and distances between nodes using a convergent joint loss function that prioritizes neighborhood restoration and downweighs distant node detection. Our experiments on three SotA datasets show consistent improvement on IE and QA tasks with the adoption of graph features. Moreover, we report that adopting the graph features accelerates convergence in the learning process druing training, despite being solely constructed through link prediction.
Federated learning (FL) is a popular way of edge computing that does not compromise user’s privacy. Current FL paradigms assume data only resides on the edge, while cloud servers only perform model averaging. However, in real-life situations such as recommender systems, the cloud server usually has abundant features and computation resources. Specifically, the cloud stores historical and interactive features, and the edge stores privacy-sensitive and real-time features. In this paper, our proposed Edge-Cloud Collaborative Knowledge Transfer Framework (ECCT) jointly utilizes the edge-side features and the cloud-side features, enabling bi-directional knowledge transfer between the two by sharing feature embeddings and prediction logits. ECCT consolidates various benefits, including enhancing personalization, enabling model heterogeneity, tolerating training asynchronization, and relieving communication burdens. Extensive experiments on public and industrial datasets demonstrate the effectiveness of ECCT.
This paper introduces Sparsified Late Interaction for Multi-vector (SLIM) retrieval with inverted indexes. Multi-vector retrieval methods have demonstrated their effectiveness on various retrieval datasets, and among them, ColBERT is the most established method based on the late interaction of contextualized token embeddings of pre-trained language models. However, efficient ColBERT implementations require complex engineering and cannot take advantage of off-the-shelf search libraries, impeding their practical use. To address this issue, SLIM first maps each contextualized token vector to a sparse, high-dimensional lexical space before performing late interaction between these sparse token embeddings. We then introduce an efficient two-stage retrieval architecture that includes inverted index retrieval followed by a score refinement module to approximate the sparsified late interaction, which is fully compatible with off-the-shelf lexical search libraries such as Lucene. SLIM achieves competitive accuracy on MS MARCO Passages and BEIR compared to ColBERT while being much smaller and faster on CPUs. To our knowledge, we are the first to explore using sparse token representations for multi-vector retrieval. Source code and data are integrated into the Pyserini IR toolkit.
Generative models have taken the world by storm — image generative models such as Stable Diffusion and DALL-E generate photo-realistic images, whereas image captioning models such as BLIP, GIT, ClipCap, and ViT-GPT2 generate descriptive and informative captions. While it may be true that these models produce remarkable results, their systematic evaluation is missing, making it hard to advance the research further. Currently, heuristic metrics such as the Inception Score and the Fréchet Inception Distance are the most prevalent metrics for the image generation task, while BLEU, CIDEr, SPICE, METEOR, BERTScore, and CLIPScore are common for the image captioning task. Unfortunately, these are poorly interpretable and are not based on the solid user-behavior model that the Information Retrieval community has worked towards. In this paper, we present a novel cross-modal retrieval framework to evaluate the effectiveness of cross-modal (image-to-text and text-to-image) generative models using reference text and images. We propose the use of scoring models based on user-behavior, such as Normalized Discounted Cumulative Gain (nDCG’@K ) and Rank-Biased Precision (RBP’@K) adjusted for incomplete judgments. Experiments using ECCV Caption and Flickr8k-EXPERTS benchmark datasets demonstrate the effectiveness of various image captioning and image generation models for the proposed retrieval task. Results also indicate that the nDCG’@K and RBP’@K scores are consistent with heuristics-driven metrics, excluding CLIPScore, in model selection.
In the classical e-commerce platforms, the personalized product-tying recommendation has proven to be of great added value, which improves users’ purchase willingness to product-tying by displaying the suitable marketing creative. In this paper, we present a new recommendation problem, i.e., the Pop-up One-time Marketing (POM), where the product-tying marketing creative only pops up one time when the user pays for the main item. POM has become a ubiquitous application in e-commerce platforms, e.g., buy the mobile tying mobile case and buy flight ticket tying insurance. However, many existing recommendation methods are sub-optimal for the creative marketing in the POM scenario due to unconsidering the unique characteristics in the scenario. To tackle this problem, we propose a novel framework named Event-aware Adaptive Clustering Uplift Network (EACU-Net) for the POM scenario, which is to our best knowledge the first attempt along this line. EACU-Net contains three modules: (1) the event-aware graph cascading learning, which employs a heterogeneous graph network to comprehensively learn the embedding for the user attributes, event categories, and creative elements by stage. (2) an adaptive clustering uplift network, which learns the sensitivity of users to creatives under the same context. (3) an event-aware information gain network to learn more information from samples with event affection. Extensive offline and online evaluations on a real-world e-commerce platform demonstrate the superior performance of the proposed model compared with the state-of-the-art method.
Examining the Impact of Uncontrolled Variables on Physiological Signals in User Studies for Information Processing Activities
Physiological signals can potentially be applied as objective measures to understand the behavior and engagement of users interacting with information access systems. However, the signals are highly sensitive, and many controls are required in laboratory user studies. To investigate the extent to which controlled or uncontrolled (i.e., confounding) variables such as task sequence or duration influence the observed signals, we conducted a pilot study where each participant completed four types of information-processing activities (READ, LISTEN, SPEAK, and WRITE). Meanwhile, we collected data on blood volume pulse, electrodermal activity, and pupil responses. We then used machine learning approaches as a mechanism to examine the influence of controlled and uncontrolled variables that commonly arise in user studies. Task duration was found to have a substantial effect on the model performance, suggesting it represents individual differences rather than giving insight into the target variables. This work contributes to our understanding of such variables in using physiological signals in information retrieval user studies.
Neural retrieval models (NRMs) have been shown to outperform their statistical counterparts owing to their ability to capture semantic meaning via dense document representations. These models, however, suffer from poor interpretability as they do not rely on explicit term matching. As a form of local per-query explanations, we introduce the notion of equivalent queries that are generated by maximizing the similarity between the NRM’s results and the result set of a sparse retrieval system with the equivalent query. We then compare this approach with existing methods such as RM3-based query expansion and contrast differences in retrieval effectiveness and in the terms generated by each approach.
Semantic place retrieval aims to find the top-k place entities, which are both textually relevant and spatially close to a given query, from a knowledge graph. In this work, our contribution toward improving the efficiency of semantic place retrieval is two-fold. First, we show that by applying an ad hoc yet intuitive restriction on the depth of search on the knowledge graph, it is possible to adopt IR-tree indexing scheme , which has been introduced for processing spatial keyword queries, for the semantic place retrieval scenario. Secondly, as a novel solution to this problem, we adapt the idea of cluster-skipping inverted index (CS-IIS) [1, 4], which has been originally proposed for retrieval over topically clustered document collections. Our experiments show that CS-IIS is comparable to IR-tree in terms of CPU time, while it yields substantial efficiency gains in terms of I/O time during query processing.
Recent years have witnessed the transition from sentence-level to document-level in relation extraction (RE), with new formulation, new methods and new insights. Yet, the fundamental concept, mention, is not well-considered and well-defined. Current datasets usually use automatically-detected named entities as mentions, which leads to the missing reference problem. We show that such phenomenon hinders models’ reasoning abilities. To address it, we propose to incorporate coreferences (e.g. pronouns and common nouns) into mentions, based on which we refine and re-annotate the widely-used DocRED benchmark as R-DocRED. We evaluate various methods and conduct thorough experiments to demonstrate the efficacy of our formula. Specifically, the results indicate that incorporating coreferences helps reduce the long-term dependencies, further improving models’ robustness and generalization under adversarial and low-resource settings. The new dataset is made publicly available for future research.
Bandit algorithms for online learning to rank (OLTR) problems often aim to maximize long-term revenue by utilizing user feedback. From a practical point of view, however, such algorithms have a high risk of hurting user experience due to their aggressive exploration. Thus, there has been a rising demand for safe exploration in recent years. One approach to safe exploration is to gradually enhance the quality of an original ranking that is already guaranteed acceptable quality. In this paper, we propose a safe OLTR algorithm that efficiently exchanges one of the items in the current ranking with an item outside the ranking (i.e., an unranked item) to perform exploration. We select an unranked item optimistically to explore based on Kullback-Leibler upper confidence bounds (KL-UCB) and safely re-rank the items including the selected one. Through experiments, we demonstrate that the proposed algorithm improves long-term regret from baselines without any safety violation.
Summarization of textual content has many applications, ranging from summarizing long documents to recent efforts towards summarizing user generated text (e.g., tweets, Facebook or Reddit posts). Traditionally, the focus of summarization has been to generate summaries which can best satisfy the readers. In this work, we look at summarization of user-generated content as a two-sided problem where satisfaction of both readers and authors is crucial. Through three surveys, we show that for user-generated content, traditional evaluation approach of measuring similarity between reference summaries and algorithmic summaries cannot capture author satisfaction. We propose an author satisfaction-based evaluation metric CROSSEM which, we show empirically, can potentially complement the current evaluation paradigm. We further propose the idea of inequality in satisfaction, to account for individual fairness amongst readers and authors. To our knowledge, this is the first attempt towards developing a fair summary evaluation framework for user generated content, and is likely to spawn lot of future research in this space.
Widely used dynamic pruning algorithms (such as MaxScore, WAND and BMW) keep track of the k-th highest score (i.e., heap threshold) among the documents that are scored so far, to avoid scoring the documents that cannot get into the top-k result list. Obviously, the faster the heap threshold converges to its final value, the larger will be the number of skipped documents and hence, the efficiency gains of the pruning algorithms. In this paper, we tailor approaches that reorder the documents in the inverted index based on their access counts and ranks for previous queries. By storing such frequently retrieved documents at front of the postings lists, we aim to compute the heap threshold earlier during the query processing. Our approach yields substantial speedups (up to 1.33x) for all three dynamic pruning algorithms and outperforms two strong baselines that have been employed for document reordering in the literature.
Multi-layer perceptron (MLP) serves as a core component in many deep models for click-through rate (CTR) prediction. However, vanilla MLP networks are inefficient in learning multiplicative feature interactions, making feature interaction learning an essential topic for CTR prediction. Existing feature interaction networks are effective in complementing the learning of MLPs, but they often fall short of the performance of MLPs when applied alone. Thus, their integration with MLP networks is necessary to achieve improved performance. This situation motivates us to explore a better alternative to the MLP backbone that could potentially replace MLPs. Inspired by factorization machines, in this paper, we propose FINAL, a factorized interaction layer that extends the widely-used linear layer and is capable of learning 2nd-order feature interactions. Similar to MLPs, multiple FINAL layers can be stacked into a FINAL block, yielding feature interactions with an exponential degree growth. We unify feature interactions and MLPs into a single FINAL block and empirically show its effectiveness as a replacement for the MLP block. Furthermore, we explore the ensemble of two FINAL blocks as an enhanced two-stream CTR model, setting a new state-of-the-art on open benchmark datasets. FINAL can be easily adopted as a building block and has achieved business metric gains in multiple applications at Huawei. Our source code will be made available at MindSpore/models and FuxiCTR/model_zoo.
Modern search and recommendation systems are optimized using logged interaction data. There is increasing societal pressure to enable users of such systems to have some of their data deleted from those systems. This paper focuses on “unlearning” such user data from neighborhood-based recommendation models on sparse, high-dimensional datasets. We present caboose, a custom top-k index for such models, which enables fast and exact deletion of user interactions. We experimentally find that caboose provides competitive index building times, makes sub-second unlearning possible (even for a large index built from one million users and 256 million interactions), and, when integrated into three state-of-the-art next-basket recommendation models, allows users to effectively adjust their predictions to remove sensitive items.
Friend recall is an important way to improve Daily Active Users (DAU) in online games. The problem is to generate a proper inactive (lost) friend ranking list essentially. Traditional friend recall methods focus on rules like friend intimacy or training a classifier for predicting lost players’ return probability, but ignore feature information of (active) players and historical friend recall events. In this work, we treat friend recall as a link prediction problem and explore several link prediction methods which can use features of both active and lost players, as well as historical events. Furthermore, we propose a novel Edge Transformer model and pre-train the model via masked auto-encoders. Our method achieves state-of-the-art results in the offline experiments and online A/B Tests of three Tencent games.
Continual graph learning (CGL) aims to mitigate the topological-feature-induced catastrophic forgetting problem (TCF) in graph neural networks, which plays an essential role in the field of information retrieval. The TCF is mainly caused by the forgetting of node features of old tasks and the forgetting of topological features shared by old and new tasks. Existing CGL methods do not pay enough attention to the forgetting of topological features shared between different tasks. In this paper, we propose a transformer-based CGL method (Trans-CGL), thereby taking full advantage of the transformer’s properties to mitigate the TCF problem. Specifically, to alleviate forgetting of node features, we introduce a gated attention mechanism for Trans-CGL based on parameter isolation that allows the model to be independent of each other when learning old and new tasks. Furthermore, to address the forgetting of shared parameters that store topological information between different tasks, we propose an asymmetric mask attention regularization module to constrain the shared attention parameters ensuring that the shared topological information is preserved. Comparative experiments show that the method achieves competitive performance on four real-world datasets.
Current query expansion models use pseudo-relevance feedback to improve first-pass retrieval effectiveness; however, this fails when the initial results are not relevant. Instead of building a language model from retrieved results, we propose Generative Relevance Feedback (GRF) that builds probabilistic feedback models from long-form text generated from Large Language Models. We study the effective methods for generating text by varying the zero-shot generation subtasks: queries, entities, facts, news articles, documents, and essays. We evaluate GRF on document retrieval benchmarks covering a diverse set of queries and document collections, and the results show that GRF methods significantly outperform previous PRF methods. Specifically, we improve MAP between 5-19% and NDCG@10 17-24% compared to RM3 expansion, and achieve state-of-the-art recall across all datasets.
Multi-task learning (MTL) has been widely applied in online advertising systems. To address the negative transfer issue, recent optimization methods emphasized the gradient alignment of directions or magnitudes. Since prior studies have proven that the shared modules contain both general and specific knowledge, overemphasizing on gradient alignment may crowd out task-specific knowledge. In this paper, we propose a transference-driven approach CoGrad that adaptively maximizes knowledge transference via Coordinated Gradient modification. We explicitly quantify the transference as loss reduction from one task to another, and optimize it to derive an auxiliary gradient. By incorporating this gradient into original task gradients, the model automatically maximizes inter-task transfer and minimizes individual losses, leading to general and specific knowledge harmonization. Besides, we introduce an efficient approximation of the Hessian matrix, making CoGrad computationally efficient. Both offline and online experiments verify that CoGrad significantly outperforms previous methods.
Graph collaborative filtering (GCF) is a popular technique for capturing high-order collaborative signals in recommendation systems. However, GCF’s bipartite adjacency matrix, which defines the neighbors being aggregated based on user-item interactions, can be noisy for users/items with abundant interactions and insufficient for users/items with scarce interactions. Additionally, the adjacency matrix ignores user-user and item-item correlations, which can limit the scope of beneficial neighbors being aggregated.
In this work, we propose a new graph adjacency matrix that incorporates user-user and item-item correlations, as well as a properly designed user-item interaction matrix that balances the number of interactions across all users. To achieve this, we pre-train a graph-based recommendation method to obtain users/items embeddings, and then enhance the user-item interaction matrix via top-K sampling. We also augment the symmetric user-user and item-item correlation components to the adjacency matrix. Our experiments demonstrate that the enhanced user-item interaction matrix with improved neighbors and lower density leads to significant benefits in graph-based recommendation. Moreover, we show that the inclusion of user-user and item-item correlations can improve recommendations for users with both abundant and insufficient interactions.
Negative sampling plays a crucial role in training successful sequential recommendation models. Instead of merely employing random negative sample selection, numerous strategies have been proposed to mine informative negative samples to enhance training and performance. However, few of these approaches utilize structural information. In this work, we observe that as training progresses, the distributions of node-pair similarities in different groups with varying degrees of neighborhood overlap change significantly, suggesting that item pairs in distinct groups may possess different negative relationships. Motivated by this observation, we propose a graph-based negative sampling approach based on neighborhood overlap (GNNO) to exploit structural information hidden in user behaviors for negative mining. GNNO first constructs a global weighted item transition graph using training sequences. Subsequently, it mines hard negative samples based on the degree of overlap with the target item on the graph. Furthermore, GNNO employs curriculum learning to control the hardness of negative samples, progressing from easy to difficult. Extensive experiments on three Amazon benchmarks demonstrate GNNO’s effectiveness in consistently enhancing the performance of various state-of-the-art models and surpassing existing negative sampling strategies. The code will be released at https://github.com/floatSDSDS/GNNO.
Knowledge graph embedding aims at modeling knowledge by projecting entities and relations into a low-dimensional semantic space. Most of the works on knowledge graph embedding construct negative samples by negative sampling as knowledge graphs typically only contain positive facts. Although substantial progress has been made by dynamic distribution based sampling methods, selecting plausible and prior information-engaged negative samples still poses many challenges. Inspired by type constraint methods, we propose Hierarchical Type Enhanced Negative Sampling (HTENS) which leverages hierarchical entity type information and entity-relation cooccurrence information to optimize the sampling probability distribution of negative samples. The experiments performed on the link prediction task demonstrate the effectiveness of HTENS. Additionally, HTENS shows its superiority in versatility and can be integrated into scalable systems with enhanced negative sampling.
Medical decision-making processes can be enhanced by comprehensive biomedical knowledge bases, which require fusing knowledge graphs constructed from different sources via a uniform index system. The index system often organizes biomedical terms in a hierarchy to provide the aligned entities with fine-grained granularity. To address the challenge of scarce supervision in the biomedical knowledge fusion (BKF) task, researchers have proposed various unsupervised methods. However, these methods heavily rely on ad-hoc lexical and structural matching algorithms, which fail to capture the rich semantics conveyed by biomedical entities and terms. Recently, neural embedding models have proved effective in semantic-rich tasks, but they rely on sufficient labeled data to be adequately trained. To bridge the gap between the scarce-labeled BKF and neural embedding models, we propose HiPrompt, a supervision-efficient knowledge fusion framework that elicits the few-shot reasoning ability of large language models through hierarchy-oriented prompts. Empirical results on the collected KG-Hi-BKF benchmark datasets demonstrate the effectiveness of HiPrompt.
Existing community detection methods for attributed multiplex networks focus on exploiting the complementary information from different topologies, while they are paying little attention to the role of attributes. However, we observe that real attributed multiplex networks exhibit two unique features, namely, consistency and homogeneity of node attributes. Therefore, in this paper, we propose a novel method, called ACDM, which is based on these two characteristics of attributes, to detect communities on attributed multiplex networks. Specifically, we extract commonality representation of nodes through the consistency of attributes. The collaboration between the homogeneity of attributes and topology information reveals the particularity representation of nodes. The comprehensive experimental results on real attributed multiplex networks well validate that our method outperforms state-of-the-art methods in most networks.
Learning expressive representations for high-dimensional yet sparse features has been a longstanding problem in information retrieval. Though recent deep learning methods can partially solve the problem, they often fail to handle the numerous sparse features, particularly those tail feature values with infrequent occurrences in the training data. Worse still, existing methods cannot explicitly leverage the correlations among different instances to help further improve the representation learning on sparse features since such relational prior knowledge is not provided. To address these challenges, in this paper, we tackle the problem of representation learning on feature-sparse data from a graph learning perspective. Specifically, we propose to model the sparse features of different instances using hypergraphs where each node represents a data instance and each hyperedge denotes a distinct feature value. By passing messages on the constructed hypergraphs based on our Hypergraph Transformer (HyperFormer), the learned feature representations capture not only the correlations among different instances but also the correlations among features. Our experiments demonstrate that the proposed approach can effectively improve feature representation learning on sparse features.
Advancements in text-guided diffusion models have allowed for the creation of visually appealing images similar to those created by professional artists. The effectiveness of these models depends on the composition of the textual description, known as the prompt, and its accompanying keywords. Evaluating aesthetics computationally is difficult, so human input is necessary to determine the ideal prompt formulation and keyword combination. In this study, we propose a human-in-the-loop method for discovering the most effective combination of prompt keywords using a genetic algorithm. Our approach demonstrates how this can lead to an improvement in the visual appeal of images generated from the same description.
Recent work has identified that distillation can be used to create vector quantization based ANN indexes by learning the inverted file index and product quantization. The argued advantage of using a fixed teacher encoder for queries and documents is that the scores produced by the teacher can be used instead of the label judgements that are required when using traditional supervised learning, such as contrastive learning. However, current work only distills the teacher encoder outputs of dot products between quantized query embedddings and product quantized document embeddings. Our work combines the benefits of contrastive learning and distillation by using contrastive distillation whereby the teacher outputs contrastive scores that the student learns from. Our experimental results on MSMARCO passage retrieval and NQ open question answering datasets show that contrastive distillation improves over current state of the art for vector quantized dense retrieval.
This paper presents ConvRerank, a conversational passage re-ranker that employs a newly developed pseudo-labeling approach. Our proposed view-ensemble method enhances the quality of pseudo-labeled data, thus improving the fine-tuning of ConvRerank. Our experimental evaluation on benchmark datasets shows that combining ConvRerank with a conversational dense retriever in a cascaded manner achieves a good balance between effectiveness and efficiency. Compared to baseline methods, our cascaded pipeline demonstrates lower latency and higher top-ranking effectiveness. Furthermore, the in-depth analysis confirms the potential of our approach to improving the effectiveness of conversational search.
Recent years have witnessed the boom of deep neural networks in online news recommendation service. As news articles mainly consist of textual content, pre-trained language models~(PLMs) (e.g. BERT) have been widely adopted as the backbone to encode them into news embeddings, which would be utilized to generate the user representations or perform the semantic matching. However, existing PLMs are mostly pre-trained on large-scale general corpus, and have not been specially adapted for capturing the rich information within news articles. Therefore, their produced news embeddings may be not informative enough to represent the news content or characterize the relations among news. To solve it, we propose a bottlenecked multi-task pre-training approach, which relies on an information-bottleneck encoder-decoder architecture to compress the useful semantic information into the news embedding. Concretely, we design three pre-training tasks, to enforce the news embedding to recover the news contents of itself, its frequently oc-occurring neighbours, and the news with similar topics. We conduct experiments on the MIND dataset and show that our approach can outperform competitive pre-training methods.
A number of information retrieval studies have been done to assess which statistical techniques are appropriate for comparing systems. However, these studies are focused on TREC-style experiments, which typically have fewer than 100 topics. There is no similar line of work for large search and recommendation experiments; such studies typically have thousands of topics or users and much sparser relevance judgements, so it is not clear if recommendations for analyzing traditional TREC experiments apply to these settings. In this paper, we empirically study the behavior of significance tests with large search and recommendation evaluation data. Our results show that the Wilcoxon and Sign tests show significantly higher Type-1 error rates for large sample sizes than the bootstrap, randomization and t-tests, which were more consistent with the expected error rate. While the statistical tests displayed differences in their power for smaller sample sizes, they showed no difference in their power for large sample sizes. We recommend the sign and Wilcoxon tests should not be used to analyze large scale evaluation results. Our result demonstrate that with Top-N recommendation and large search evaluation data, most tests would have a 100% chance of finding statistically significant results. Therefore, the effect size should be used to determine practical or scientific significance.
Queries with similar information needs tend to have similar document clicks, especially in biomedical literature search engines where queries are generally short and top documents account for most of the total clicks. Motivated by this, we present a novel architecture for biomedical literature search, namely Log-Augmented DEnse Retrieval (LADER), which is a simple plug-in module that augments a dense retriever with the click logs retrieved from similar training queries. Specifically, LADER finds both similar documents and queries to the given query by a dense retriever. Then, LADER scores relevant (clicked) documents of similar queries weighted by their similarity to the input query. The final document scores by LADER are the average of (1) the document similarity scores from the dense retriever and (2) the aggregated document scores from the click logs of similar queries. Despite its simplicity, LADER achieves new state-of-the-art (SOTA) performance on TripClick, a recently released benchmark for biomedical literature retrieval. On the frequent (HEAD) queries, LADER largely outperforms the best retrieval model by 39% relative NDCG@10 (0.338 v.s. 0.243). LADER also achieves better performance on the less frequent (TORSO) queries with 11% relative NDCG@10 improvement over the previous SOTA (0.303 v.s. 0.272). On the rare (TAIL) queries where similar queries are scarce, LADER still compares favorably to the previous SOTA method (NDCG@10: 0.310 v.s. 0.295). On all queries, LADER can improve the performance of a dense retriever by 24%-37% relative NDCG@10 while not requiring additional training, and further performance improvement is expected from more logs. Our regression analysis has shown that queries that are more frequent, have higher entropy of query similarity and lower entropy of document similarity, tend to benefit more from log augmentation.
Data collection and mining is a crucial bottleneck for cross-lingual information retrieval (CLIR). While previous works used machine translation and iterative training, we present a novel approach to cross-lingual pretraining called LAPCA (language-agnostic pretraining with cross-lingual alignment). We train the LAPCA-LM model based on XLM-RoBERTa and łexa that significantly improves cross-lingual knowledge transfer for question answering and sentence retrieval on, e.g., XOR-TyDi and Mr. TyDi datasets, and in the zero-shot cross-lingual scenario performs on par with supervised methods, outperforming many of them on MKQA.
Crowdsourcing provides a practical approach for obtaining annotated data to train supervised learning models. However, since the crowd annotators may have different expertise domain and cannot always guarantee the high-quality annotations, learning from crowds generally suffers from the problem of unreliable results of introducing some noises, which makes it hard to achieve satisfying performance. In this work, we investigate the reliability of annotations to improve learning from crowds. Specifically, we first project annotator and data instance to factor vectors and model the complex interaction between annotator expertise and instance difficulty to predict annotation reliability. The learned reliability can be used to evaluate the quality of crowdsourced data directly. Then, we construct a new annotation, namely soft annotation, which serves as the gold label during the training. To recognize the different strengths of annotators, we model each annotator’s confusion in an end-to-end manner. Extensive experimental results on three real-world datasets demonstrate the effectiveness of our method.
Mixup is an efficient data augmentation technique, which improves generalization by interpolating random examples. While numerous approaches have been developed for Mixup in the Euclidean and in the hyperbolic space, they do not fully use the intrinsic properties of the examples, i.e., they manually set the geometry (Euclidean or hyperbolic) based on the overall dataset, which may be sub-optimal since each example may require a different geometry. We propose DynaMix, a framework that automatically selects an example-specific geometry and performs Mixup between the different geometries to improve training dynamics and generalization. Through extensive experiments in image and text modalities we show that DynaMix outperforms state-of-the-art methods over six downstream applications. We find that DynaMix is more useful in low-resource and semi-supervised settings likely because it displays a probabilistic view of the geometry.
Asking clarifying questions has become a key element of various conversational systems, allowing for an effective resolution of ambiguity and uncertainty through natural language questions. Despite the extensive applications of spatial information grounded dialogues, it remains an understudied area on learning to ask clarification questions with the capability of spatial reasoning. In this work, we propose a novel method, named SpatialCQ, for this problem. Specifically, we first align the representation space between textual and spatial information by encoding spatial states with textual descriptions. Then a multi-relational graph is constructed to capture the spatial relations and enable spatial reasoning with relational graph attention networks. Finally, a unified encoder is adopted to fuse the multimodal information for asking clarification questions. Experimental results on the latest IGLU dataset show the superiority of the proposed method over existing approaches.
We present a method for performing zero-shot Dialogue State Tracking (DST) by casting the task as a learning-to-ask-questions framework. The framework learns to pair the best question generation (QG) strategy with in-domain question answering (QA) methods to extract slot values from a dialogue without any human intervention. A novel self-supervised QA pretraining step using in-domain data is essential to learn the structure without requiring any slot-filling annotations. Moreover, we show that QG methods need to be aligned with the same grammatical person used in the dialogue. Empirical evaluation on the MultiWOZ 2.1 dataset demonstrates that our approach, when used alongside robust QA models, outperforms existing zero-shot methods in the challenging task of zero-shot cross domain adaptation-given a comparable amount of domain knowledge during data creation. Finally, we analyze the impact of the types of questions used, and demonstrate that the algorithmic approach outperforms template-based question generation.
Many recent QA models retrieve answers from passages, rather than whole documents, due to the limitations of deep learning models with limited context size. However, this approach ignores important document-level cues that can be crucial in answering questions. This paper reviews three open-domain QA benchmarks from a document-level perspective and finds that they are biased towards passage-level information. Out of 17,000 assessed questions, 82 were identified as requiring document-level reasoning and could not be answered by passage-based models. Document-level retrieval (BM25) outperformed both dense and sparse passage-level retrieval on these questions, highlighting the need for more evaluation of models’ ability to understand documents, an often-overlooked challenge in open-domain QA.
Users may demand recommendations with highly personalized requirements involving logical operations, e.g., the intersection of two requirements, where such requirements naturally form structured logical queries on knowledge graphs (KGs). To date, existing recommender systems lack the capability to tackle users’ complex logical requirements. In this work, we formulate the problem of recommendation with users’ logical requirements (LogicRec) and construct benchmark datasets for LogicRec. Furthermore, we propose an initial solution for LogicRec based on logical requirement retrieval and user preference retrieval, where we face two challenges. First, KGs are incomplete in nature. Therefore, there are always missing true facts, which entails that the answers to logical requirements can not be completely found in KGs. In this case, item selection based on the answers to logical queries is not applicable. We thus resort to logical query embedding (LQE) to jointly infer missing facts and retrieve items based on logical requirements. Second, answer sets are under-exploited. Existing LQE methods can only deal with query-answer pairs, where queries in our case are the intersected user preferences and logical requirements. However, the logical requirements and user preferences have different answer sets, offering us richer knowledge about the requirements and preferences by providing requirement-item and preference-item pairs. Thus, we design a multi-task knowledge-sharing mechanism to exploit these answer sets collectively. Extensive experimental results demonstrate the significance of the LogicRec task and the effectiveness of our proposed method.
Time-series forecasting has been actively studied and adopted in various real-world domains. Recently there have been two research mainstreams in this area: building Transformer-based architectures such as Informer, Autoformer and Reformer, and developing time-series representation learning frameworks based on contrastive learning such as TS2Vec and CoST. Both efforts have greatly improved the performance of time series forecasting. In this paper, we investigate a novel direction towards improving the forecasting performance even more, which is orthogonal to the aforementioned mainstreams as a model-agnostic scheme. We focus on time stamp embeddings that has been less-focused in the literature. Our idea is simple-yet-effective: based on given current time stamp, we predict embeddings of its near future time stamp and utilize the predicted embeddings in the time-series (value) forecasting task. We believe that if such future time information can be previewed at the time of prediction, they can be utilized by any time-series forecasting models as useful additional information. Our experimental results confirmed that our method consistently and significantly improves the accuracy of the recent Transformer-based models and time-series representation learning frameworks. Our code is available at: https://github.com/sunsunmin/Look_Ahead
Organic recommendation and advertising recommendation usually coexist on e-commerce platforms. In this paper, we study the problem of utilizing data from organic recommendation to reinforce click-through rate prediction in advertising scenarios from a multi-view learning perspective. We propose a novel method, termed LOVF (Layered Organic View Fusion). LOVF implements a multi-view fusion mechanism – for each advertising instance, LOVF derives deep representations layer-by-layer from the organic recommendation view and these deep representations are then fused into the corresponding vanilla representations of the advertising view. Extensive experiments across a variety of backbones demonstrate LOVF’s generality, effectiveness and efficiency on a new real-world production dataset. The dataset encompasses data from both the organic recommendation and advertising scenarios. Notably, LOVF has been successfully deployed in the advertising recommender system of JD.com, which is one of the world’s largest e-commerce platforms; online A/B testing shows that LOVF achieves impressive improvement on advertising clicks and revenue. Our code and dataset are available at https://github.com/adsturing/lovf for facilitating further research.
Machine reading comprehension (MRC) is an essential task for many question-answering applications. However, existing MRC datasets mainly focus on data with single answer and overlook multiple answers, which are common in the real world. In this paper, we aim to construct an MRC dataset with both data of single answer and multiple answers. To achieve this purpose, we design a novel pipeline method: data collection, data cleaning, question generation and test set annotation. Based on these procedures, we construct a high-quality multi-answer MRC dataset (MA-MRC) with 129K question-answer-context samples. We implement a sequence of baselines and carry out extensive experiments on MA-MRC. According to the experimental results, MA-MRC is a challenging dataset, which can facilitate the future research on the multi-answer MRC task.
The past few years have witnessed an explosive growth of user-generated POI-centric travel blogs, which can provide a comprehensive understanding of a POI for people. However, evaluating the quality of the POI-centric travel blogs and ranking the blogs is not a simple task without domain knowledge or actual travel experience on the target POI. Nevertheless, our insight is that the user search behavior related to the target POI on the online map service can partly valid the rationality of the POIs appearing in the travel blogs, which helps for travel blogs ranking. To this end, in this paper, we propose a novel end-to-end framework for travel blogs ranking, coined Matching POI and Travel Blogs with Multi-view InFormation (MOTIF). Concretely, we first construct two POI graphs as multi-view information: (1) the search-level POI graph which reflects the user behaviors on the online map service; and (2) the document-level POI graph which shows the POI co-occurrence frequency in travel blogs. Then, to better model the intrinsic correlation of the two graphs, we adopt Mutual Information Maximization to align the search-level and document-level semantic spaces. Moreover, we leverage a pair-wise ranking loss for POI-document relevance scoring. Extensive experiments on two real-world datasets demonstrate the superiority of our method.
MaxSimE: Explaining Transformer-based Semantic Similarity via Contextualized Best Matching Token Pairs
Current semantic search approaches rely on black-box language models, such as BERT, which limit their interpretability and transparency. In this work, we propose MaxSimE, an explanation method for language models applied to measure semantic similarity. Our approach is inspired by the explainable-by-design ColBERT architecture and generates explanations by matching contextualized query tokens to the most similar tokens from the retrieved document according to the cosine similarity of their embeddings. Unlike existing post-hoc explanation methods, which may lack fidelity to the model and thus fail to provide trustworthy explanations in critical settings, we demonstrate that MaxSimE can generate faithful explanations under certain conditions and how it improves the interpretability of semantic search results on ranked documents from the LoTTe benchmark, showing its potential for trustworthy information retrieval.
Nowadays, the mainstream approach in position allocation system is to utilize a reinforcement learning model to allocate appropriate locations for items in various channels and then mix them into the feed. There are two types of data employed to train reinforcement learning (RL) model for position allocation, named strategy data and random data. Strategy data is collected from the current online model, it suffers from an imbalanced distribution of stateaction pairs, resulting in severe overestimation problems during training. On the other hand, random data offers a more uniform distribution of state-action pairs, but is challenging to obtain in industrial scenarios as it could negatively impact platform revenue and user experience due to random exploration. As the two types of data have different distributions, designing an effective strategy to leverage both types of data to enhance the efficacy of the RL model training has become a highly challenging problem. In this study, we propose a framework namedMulti-Distribution Data Learning (MDDL) to address the challenge of effectively utilizing both strategy and random data for training RL models on mixed multi-distribution data. Specifically, MDDL incorporates a novel imitation learning signal to mitigate overestimation problems in strategy data and maximizes the RL signal for random data to facilitate effective learning. In our experiments, we evaluated the proposed MDDL framework in a real-world position allocation system and demonstrated its superior performance compared to the previous baseline. MDDL has been fully deployed on the Meituan food delivery platform and currently serves over 300 million users.
Medical dialogue systems (MDS) have shown promising abilities to diagnose through a conversation with a patient like a human doctor would. However, current systems are mostly based on sequence modeling, which does not account for medical knowledge. This makes the systems more prone to misdiagnosis in case of diseases with limited information. To overcome this issue, we present MDKG, an end-to-end dialogue system for medical dialogue generation (MDG) specifically designed to adapt to new diseases by quickly learning and evolving a meta-knowledge graph that allows it to reason about disease-symptom correlations. Our approach relies on a medical knowledge graph to extract disease-symptom relationships and uses a dynamic graph-based meta-learning framework to learn how to evolve the given knowledge graph to reason about disease-symptom correlations. Our approach incorporates medical knowledge and hence reduces the need for a large number of dialogues. Evaluations show that our system outperforms existing approaches when tested on benchmark datasets.
In order to determine the relevance of a given item to a query, most modern search ranking systems make use of features which aggregate prior user behavior for that item and query (e.g. click rate). For practical reasons, when running A/B tests on ranking systems, these features are generally shared between all treatments. For the most common experiment designs, which randomize traffic by user or by session, this creates a pathway by which the behavior of units in one treatment can effect the outcomes for units in other treatments, violating the Stable Unit Treatment Value Assumption (SUTVA) and biasing measured outcomes. Moreover, for experiments targeting improvements to the behavior data available to such features (e.g. online exploration), this pathway is precisely the one we are trying to affect; if such changes occur identically in treatment and control, then they cannot be measured. To address this, we propose the use of experiments which instead randomize traffic based on the search query. To validate our approach, we perform a pair of A/B tests on an explore-exploit framework in the Amazon search page: one under query randomization, and one under user randomization. In line with the theoretical predictions, we find that the query-randomized iteration is able to measure a statistically significant effect (+0.66% Purchases, p=0.001) where the user-randomized iteration does not (-0.02% Purchases, p=0.851).
Session-based Recommendation (SR) aims to predict users’ next click based on their behavior within a short period, which is crucial for online platforms. However, most existing SR methods somewhat ignore the fact that user preference is not necessarily strongly related to the order of interactions. Moreover, they ignore the differences in importance between different samples, which limits the model-fitting performance. To tackle these issues, we put forward the method, Mining Interest Trends and Adaptively Assigning Sample Weight, abbreviated as MTAW. Specifically, we model users’ instant interest based on their present behavior and all their previous behaviors. Meanwhile, we discriminatively integrate instant interests to capture the changing trend of user interest to make more personalized recommendations. Furthermore, we devise a novel loss function that dynamically weights the samples according to their prediction difficulty in the current epoch. Extensive experimental results on two benchmark datasets demonstrate the effectiveness and superiority of our method.
Model-free RL-based recommender systems have recently received increasing research attention due to their capability to handle partial feedback and long-term rewards. However, most existing research has ignored a critical feature in recommender systems: one user’s feedback on the same item at different times is random. The stochastic rewards property essentially differs from that in classic RL scenarios with deterministic rewards, which makes RL-based recommender systems much more challenging. In this paper, we first demonstrate in a simulator environment where using direct stochastic feedback results in a significant drop in performance. Then to handle the stochastic feedback more efficiently, we design two stochastic reward stabilization frameworks that replace the direct stochastic feedback with that learned by a supervised model. Both frameworks are model-agnostic, i.e., they can effectively utilize various supervised models. We demonstrate the superiority of the proposed frameworks over different RL-based recommendation baselines with extensive experiments on a recommendation simulator as well as an industrial-level recommender system.
Modeling Orders of User Behaviors via Differentiable Sorting: A Multi-task Framework to Predicting User Post-click Conversion
User post-click conversion prediction is of high interest to researchers and developers. Recent studies employ multi-task learning to tackle the selection bias and data sparsity problem, two severe challenges in post-click behavior prediction, by incorporating click data. However, prior works mainly foucsed on pointwise learning and the orders of labels (i.e., click and post-click) are not well explored, which naturally poses a listwise learning problem. Inspired by recent advances on differentiable sorting, in this paper, we propose a novel multi-task framework that leverages orders of user behaviors to predict user post-click conversion in an end-to-end approach. Specifically, we define an aggregation operator to combine predicted outputs of different tasks to a unified score, then we use the computed scores to model the label relations via differentiable sorting. Extensive experiments on public and industrial datasets show the superiority of our proposed model against competitive baselines.
Relevance models measure the semantic closeness between queries and the candidate ads, widely recognized as the nucleus of sponsored search systems. Conventional relevance models solely rely on the textual data within the queries and ads, whose performance is hindered by the scarce semantic information in these short texts. Recently, user behavior graphs have been incorporated to provide complementary information beyond pure textual semantics.Despite the promising performance, behavior-enhanced models suffer from exhausting resource costs due to the extra computations introduced by explicit topological aggregations. In this paper, we propose a novel Multi-Grained Topological Pre-Training paradigm, MGTLM, to teach language models to understand multi-grained topological information in behavior graphs, which contributes to eliminating explicit graph aggregations and avoiding information loss. Extensive experimental results over online and offline settings demonstrate the superiority of our proposal.
The purpose of audio-text retrieval is to learn a cross-modal similarity function between audio and text, enabling a given audio/text to find similar text/audio from a candidate set. Recent audio-text retrieval models aggregate multi-modal features into a single-grained representation. However, single-grained representation is difficult to solve the situation that an audio is described by multiple texts of different granularity levels, because the association pattern between audio and text is complex. Therefore, we propose an adaptive aggregation strategy to automatically find the optimal pool function to aggregate the features into a comprehensive representation, so as to learn valuable multi-grained representation. And multi-grained comparative learning is carried out in order to focus on the complex correlation between audio and text in different granularity. Meanwhile, text-guided token interaction is used to reduce the impact of redundant audio clips. We evaluated our proposed method on two audio-text retrieval benchmark datasets of Audiocaps and Clotho, achieving the state-of-the-art results in text-to-audio and audio-to-text retrieval. Our findings emphasize the importance of learning multi-modal multi-grained representation.
Since existing methods are often not effective to detect communities with multiple topics in attributed networks, we propose a method named SSAGCN via Autoencoder-style self-supervised learning. SSAGCN firstly designs an adaptive graph convolutional network (AGCN), which is treated as the encoder for fusing topology information and attribute information automatically, and then utilizes a dual decoder to simultaneously reconstruct network topology and attributes. By further introducing the modularity maximization and the joint optimization strategies, SSAGCN can detect communities with multiple topics in an end-to-end manner. Experimental results show that SSAGCN outperforms state-of-the-art approaches, and also can be used to conduct topic analysis well.
Automated essay scoring (AES) is a crucial research area with potential applications in education and beyond. However, recent studies have primarily focused on AES models that evaluate essays within a specific domain or using a holistic score, leaving a gap in research and resources for more generalized models capable of assessing essays with detailed items from multiple perspectives. As evaluating and scoring essays based on complex traits is costly and time-consuming, datasets for such AES evaluations are limited. To address these issues, we developed a cross-prompt trait scoring AES model and proposed a suitable curriculum learning (CL) design. By devising difficulty scores and introducing the key curriculum method, we demonstrated its effectiveness compared to existing CL strategies in natural language understanding tasks.
Dense retrieval has made significant advancements in information retrieval (IR) by achieving high levels of effectiveness while maintaining online efficiency during a single-pass retrieval process. However, the application of pseudo relevance feedback (PRF) to further enhance retrieval effectiveness results in a doubling of online latency. To address this challenge, this paper presents a single-pass dense retrieval framework that shifts the PRF process offline through the utilization of pre-generated pseudo-queries. As a result, online retrieval is reduced to a single matching with the pseudo-queries, hence providing faster online retrieval. The effectiveness of the proposed approach is evaluated on the standard TREC DL and HARD datasets, and the results demonstrate its promise. Our code is openly available at https://github.com/Rosenberg37/OPRF https://github.com/Rosenberg37/OPRF.
Extractive Transformer-based models for question answering (QA) are trained to predict the start and end position of the answer in a candidate paragraph. However, the true answer position can bias these models when its distribution in the training data is highly skewed. That is, models trained only with the answer at the beginning of the paragraph will perform poorly on test instances with the answer at the end. Many studies have focused on countering answer position bias but have yet to deepen our understanding of how such bias manifests in the main components of the Transformer. In this paper, we analyze the self-attention and embedding generation components of five Transformer-based models with different architectures and position embedding strategies. Our analysis shows that models tend to map position bias in their attention matrices, generating embeddings that correlate the answer and its biased position, ultimately compromising model generalization.
One advantage of neural ranking models is that they are meant to generalise well in situations of synonymity i.e. where two words have similar or identical meanings. In this paper, we investigate and quantify how well various ranking models perform in a clear-cut case of synonymity: when words are simply expressed in different surface forms due to regional differences in spelling conventions (e.g., color vs colour). We first explore the prevalence of American and British English spelling conventions in datasets used for the pre-training, training and evaluation of neural retrieval methods, and find that American spelling conventions are far more prevalent. Despite these biases in the training data, we find that retrieval models often generalise well in this case of synonymity. We explore the effect of document spelling normalisation in retrieval and observe that all models are affected by normalising the document’s spelling. While they all experience a drop in performance when normalised to a different spelling convention than that of the query, we observe varied behaviour when the document is normalised to share the query spelling convention: lexical models show improvements, dense retrievers remain unaffected, and re-rankers exhibit contradictory behaviour.
With the proliferation of algorithmic decision-making, increased scrutiny has been placed on these systems. This paper explores the relationship between the quality of the training data and the overall fairness of the models trained with such data in the context of supervised classification. We measure key fairness metrics across a range of algorithms over multiple image classification datasets that have a varying level of noise in both the labels and the training data itself. We describe noise in the labels as inaccuracies in the labelling of the data in the training set and noise in the data as distortions in the data, also in the training set. By adding noise to the original datasets, we can explore the relationship between the quality of the training data and the fairness of the output of the models trained on that data.
Dealing with unjudged documents (“holes”) in relevance assessments is a perennial problem when evaluating search systems with offline experiments. Holes can reduce the apparent effectiveness of retrieval systems during evaluation and introduce biases in models trained with incomplete data. In this work, we explore whether large language models can help us fill such holes to improve offline evaluations. We examine an extreme, albeit common, evaluation setting wherein only a single known relevant document per query is available for evaluation. We then explore various approaches for predicting the relevance of unjudged documents with respect to a query and the known relevant document, including nearest neighbor, supervised, and prompting techniques. We find that although the predictions of these One-Shot Labelers (1SL) frequently disagree with human assessments, the labels they produce yield a far more reliable ranking of systems than the single labels do alone. Specifically, the strongest approaches can consistently reach system ranking correlations of over 0.86 with the full rankings over a variety of measures. Meanwhile, the approach substantially increases the reliability of t-tests due to filling holes in relevance assessments, giving researchers more confidence in results they find to be significant. Alongside this work, we release an easy-to-use software package to enable the use of 1SL for evaluation of other ad-hoc collections or systems.
Next item recommendation is a crucial task of session-based recommendation. However, the gap between the optimization objective (Binary Cross Entropy) and the ranking metric (Mean Reciprocal Rank) has not been well-explored, resulting in sub-optimal recommendations. In this paper, we propose a novel objective function, namely Adjusted-RR, to directly optimize Mean Reciprocal Rank. Specifically, Adjusted-RR adopts Bayesian Average to adjust Reciprocal Rank loss with Normal Rank loss by creating position-aware weights between them. Adjusted-RR is a plug-and-play objective that is compatible with various models. We apply Adjusted-RR on two base models and two datasets, and experimental results show that it makes a significant improvement in the next item recommendation.
Users of search systems often reformulate their queries by adding query terms to reflect their evolving information need or to more precisely express their information need when the system fails to surface relevant content. Analyzing these query reformulations can inform us about both system and user behavior. In this work, we study a special category of query reformulations that involve specifying demographic group attributes, such as gender, as part of the reformulated query (e.g., ”olympic 2021 soccer results” -> ”olympic 2021 women’s soccer results”). There are many ways a query, the search results, and a demographic attribute such as gender may relate, leading us to hypothesize different causes for these reformulation patterns, such as under-representation on the original result page or based on the linguistic theory of markedness. This paper reports on an observational study of gender-specializing query reformulations—their contexts and effects—as a lens on the relationship between system results and gender, based on large-scale search log data from Bing. We find that these reformulations sometimes correct for and other times reinforce gender representation on the original result page, but typically yield better access to the ultimately-selected results. The prevalence of these reformulations—and which gender they skew towards—differ by topical context. However, we do not find evidence that either group under-representation or markedness alone adequately explains these reformulations. We hope that future research will use such reformulations as a probe for deeper investigation into gender (and other demographic) representation on the search result page.
With the development of online platforms, people can share and obtain opinions quickly. It also makes individuals’ preferences change dynamically and rapidly because they may change their minds when getting convincing opinions from other users. Unlike representative areas of recommendation research such as e-commerce platforms where items’ features are fixed, in investment scenarios financial instruments’ features such as stock price, also change dynamically over time. To capture these dynamic features and provide a better-personalized recommendation for amateur investors, this study proposes a Personalized Dynamic Recommender System for Investors, PDRSI. The proposed PDRSI considers two investor’s personal features: dynamic preferences and historical interests, and two temporal environmental properties: recent discussions on the social media platform and the latest market information. The experimental results support the usefulness of the proposed PDRSI, and the ablation studies show the effect of each module. For reproduction, we follow Twitter’s developer policy to share our dataset for future work.
Existing explanation models generate only text for recommendations but still struggle to produce diverse contents. In this paper, to further enrich explanations, we propose a new task named personalized showcases, in which we provide both textual and visual information to explain our recommendations. Specifically, we first select a personalized image set that is the most relevant to a user’s interest toward a recommended item. Then, natural language explanations are generated accordingly given our selected images. For this new task, we collect a large-scale dataset from Google Maps and construct a high-quality subset for generating multi-modal explanations. We propose a personalized multi-modal framework which can generate diverse and visually-aligned explanations via contrastive learning. Experiments show that our framework benefits from different modalities as inputs, and is able to produce more diverse and expressive explanations compared to previous methods on a variety of evaluation metrics.
The Transformer Memory as a Differentiable Search Index (DSI) has been proposed as a new information retrieval paradigm, which aims to address the limitations of dual-encoder retrieval framework based on the similarity score. The DSI framework outperforms strong baselines by directly generating relevant document identifiers from queries without relying on an explicit index. The memorization power of DSI framework makes it suitable for personalized retrieval tasks. Therefore, we propose a Personal Transformer Memory (PersonalTM) architecture for personalized text retrieval. PersonalTM incorporates user-specific profiles and contextual user click behaviors, and introduces hierarchical loss in the decoding process to align with the hierarchical assignment of document identifier. Additionally, PersonalTM also employs an adapter architecture to improve the scalability for index updates and reduce computation costs, compared to the vanilla DSI. Experiments show that PersonalTM outperforms the DSI baseline, BM25, fine-tuned dual-encoder, and other personalized models in terms of precision at top 1st and 10th positions and Mean Reciprocal Rank (MRR). Specifically, PersonalTM improves p@1 by 58%, 49%, and 12% compared to BM25, Dual-encoder, and DSI, respectively.
Vision-language (VL) Pre-training (VLP) has shown to well generalize VL models over a wide range of VL downstream tasks, especially for cross-modal retrieval. However, it hinges on a huge amount of image-text pairs, which requires tedious and costly curation. On the contrary,weakly-supervised VLP (W-VLP) explores means with object tags generated by a pre-trained object detector (OD) from images. Yet, they still require paired information, i.e. images and object-level annotations, as supervision to train an OD.
To further reduce the amount of supervision, we propose Prompts-in-The-Loop (PiTL) that prompts knowledge from large language models (LLMs) to describe images. Concretely, given a category label of an image, e.g.refinery, the knowledge, e.g.a refinery could be seen with large storage tanks, pipework, and …, extracted by LLMs is used as the language counterpart. The knowledge supplements, e.g. the common relations among entities most likely appearing in a scene. We create IN14K, a new VL dataset of 9M images and 1M descriptions of 14K categories from ImageNet21K with PiTL. Empirically, the VL models pre-trained with PiTL-generated pairs are strongly favored over other W-VLP works on image-to-text (I2T) and text-to-image (T2I) retrieval tasks, with less supervision. The results reveal the effectiveness of PiTL-generated pairs for VLP.
Lifelong seq2seq language generation models are trained with multiple domains in a lifelong learning manner, with data from each domain being observed in an online fashion. It is a well-known problem that lifelong learning suffers from the catastrophic forgetting (CF). To handle this challenge, existing works have leveraged experience replay or dynamic architecture to consolidate the past knowledge, which however result in incremental memory space or high computational cost. In this work, we propose a novel framework name “power norm based lifelong learning” (PNLLL), which aims to remedy the catastrophic forgetting issues with a power normalization on NLP transformer models. Specifically, PNLLL leverages power norm to achieve a better balance between past experience rehearsal and new knowledge acquisition. These designs enable the knowledge adaptation onto new tasks while memorizing the experience of past tasks. Our experiments on paraphrase generation tasks show that PNLLL not only outperforms SOTA models by a considerable margin and but also largely alleviates forgetting.
Sequential recommender models typically generate predictions in a single step during testing, without considering additional prediction correction to enhance performance as humans would. To improve the accuracy of these models, some researchers have attempted to simulate human analogical reasoning to correct predictions for testing data by drawing analogies with the prediction errors of similar training data. However, there are inherent gaps between testing and training data, which can make this approach unreliable. To address this issue, we propose an Abductive Prediction Correction (APC) framework for sequential recommendation. Our approach simulates abductive reasoning to correct predictions. Specifically, we design an abductive reasoning task that infers the most probable historical interactions from the future interactions predicted by a recommender, and minimizes the discrepancy between the inferred and true historical interactions to adjust the predictions. We perform the abductive inference and adjustment using a reversed sequential model in the forward and backward propagation manner of neural networks. Our APC framework is applicable to various differentiable sequential recommender models. We implement it on three backbone models and demonstrate its effectiveness. We release the code at https://github.com/zyang1580/APC.
In order to accurately simulate users in conversational systems, it is essential to comprehend the factors that influence their behaviour. This is a critical challenge for the Information Retrieval (IR) field, as conventional methods are not well-suited for the interactive and unique sequential structure of conversational contexts. In this study, we employed the concept of Priming effects from the Psychology literature to identify core stimuli for each abstracted effect. We then examined these stimuli on various datasets to investigate their correlations with users’ actions. Finally, we trained Logistic Regression (LR) models based on these stimuli to anticipate users’ actions. Our findings offer a basis for creating more realistic user models and simulators, as we identified the subset of stimuli with strong relationships with users’ actions. Additionally, we built a model that can predict users’ actions.
Meeting summarization has an enormous business potential, but in addition to being a hard problem, roll-out is challenged by privacy concerns. We explore the problem of meeting summarization under differential privacy constraints and find, to our surprise, that while differential privacy leads to slightly lower performance on in-sample data, differential privacy improves performance when evaluated on unseen meeting types. Since meeting summarization systems will encounter a great variety of meeting types in practical employment scenarios, this observation makes safe meeting summarization seem much more feasible. We perform extensive error analysis and identify potential risks in meeting summarization under differential privacy, including a faithfulness analysis.
Prompt Learning to Mitigate Catastrophic Forgetting in Cross-lingual Transfer for Open-domain Dialogue Generation
Dialogue systems for non-English languages have long been under-explored. In this paper, we take the first step to investigate few-shot cross-lingual transfer learning (FS-XLT) and multitask learning (MTL) in the context of open-domain dialogue generation for non-English languages with limited data. We observed catastrophic forgetting in both FS-XLT and MTL for all 6 languages in our preliminary experiments. To mitigate the issue, we propose a simple yet effective prompt learning approach that can preserve the multilinguality of multilingual pre-trained language model (mPLM) in FS-XLT and MTL by bridging the gap between pre-training and fine-tuning with Fixed-prompt LM Tuning and our hand-crafted prompts. Experimental results on all 6 languages in terms of both automatic and human evaluations demonstrate the effectiveness of our approach. Our code is available at https://github.com/JeremyLeiLiu/XLinguDial.
Predicting churn and designing intervention strategies are crucial for online platforms to maintain user engagement. We hypothesize that predicting churn, i.e. users leaving from the system without further return, is often a delayed act, and it might get too late for the system to intervene. We propose detecting early signs of users losing interest, allowing time for intervention, and introduce a new formulation ofuser fatigue as short-term dissatisfaction, providing early signals to predict long-term churn. We identify behavioral signals predicting fatigue and develop models for fatigue prediction. Furthermore, we leverage the predicted fatigue estimates to develop fatigue-aware ad-load balancing intervention strategy that reduces churn, improving short- and long-term user retention. Results from deployed recommendation system and multiple live A/B tests across over 80 million users generating over 200 million sessions highlight gains for user engagement and platform strategic metrics.
The information retrieval community has observed significant performance improvements over various tasks due to the introduction of neural architectures. However, such improvements do not necessarily seem to have happened uniformly across a range of queries. As we will empirically show in this paper, the performance of neural rankers follow a long-tail distribution where there are many subsets of queries, which are not effectively satisfied by neural methods. Despite this observation, performance is often reported using standard retrieval metrics, such as MRR or nDCG, which capture average performance over all queries. As such, it is not clear whether reported improvements are due to incremental boost on a small subset of already well-performing queries or addressing queries that have been difficult to address by existing methods. In this paper, we propose the Task Subspace Coverage (TaSC /tAHsk/) metric, which systematically quantifies whether and to what extent improvements in retrieval effectiveness happen on similar or disparate query subspaces for different rankers. Our experiments show that the consideration of our proposed TaSC metric in conjunction with existing ranking metrics provides deeper insight into ranker performance and their contribution to overall advances on a given task.
Due to the massive size of test collections, a standard practice in IR evaluation is to construct a ‘pool’ of candidate relevant documents comprised of the top-k documents retrieved by a wide range of different retrieval systems – a process called depth-k pooling. A standard practice is to set the depth (k) to a constant value for each query constituting the benchmark set. However, in this paper we argue that the annotation effort can be substantially reduced if the depth of the pool is made a variable quantity for each query, the rationale being that the number of documents relevant to the information need can widely vary across queries. Our hypothesis is that a lower depth for queries with a small number of relevant documents, and a higher depth for those with a larger number of relevant documents can potentially reduce the annotation effort without a significant change in IR effectiveness evaluation.We make use of standard query performance prediction (QPP) techniques to estimate the number of potentially relevant documents for each query, which is then used to determine the depth of the pool. Our experiments conducted on standard test collections demonstrate that this proposed method of employing query-specific variable depths is able to adequately reflect the relative effectiveness of IR systems with a substantially smaller annotation effort.
Pretrained language models such as BERT have been shown to be exceptionally effective for text ranking. However, there are limited studies on how to leverage more powerful sequence-to-sequence models such as T5. Existing attempts usually formulate text ranking as a classification problem and rely on postprocessing to obtain a ranked list. In this paper, we propose RankT5 and study two T5-based ranking model structures, an encoder-decoder and an encoder-only one, so that they not only can directly output ranking scores for each query-document pair, but also can be fine-tuned with pairwise or listwise ranking losses to optimize ranking performance. Our experiments show that the proposed models with ranking losses can achieve substantial ranking performance gains on different public text ranking data sets. Moreover, ranking models fine-tuned with listwise ranking losses have better zero-shot ranking performance on out-of-domain data than models fine-tuned with classification losses.
Rating Prediction in Conversational Task Assistants with Behavioral and Conversational-Flow Features
Predicting the success of Conversational Task Assistants (CTA) can be critical to understand user behavior and act accordingly. In this paper, we propose TB-Rater, a Transformer model which combines conversational-flow features with user behavior features for predicting user ratings in a CTA scenario. In particular, we use real human-agent conversations and ratings collected in the Alexa TaskBot challenge, a novel multimodal and multi-turn conversational context. Our results show the advantages of modeling both the conversational-flow and behavioral aspects of the conversation in a single model for offline rating prediction. Additionally, an analysis of the CTA-specific behavioral features brings insights into this setting and can be used to bootstrap future systems.
Real-world fact verification task aims to verify the factuality of a claim by retrieving evidence from the source document. The quality of the retrieved evidence plays an important role in claim verification. Ideally, the retrieved evidence should be faithful (reflecting the model’s decision-making process in claim verification) and plausible (convincing to humans), and can improve the accuracy of verification task. Although existing approaches leverage the similarity measure of semantic or surface form between claims and documents to retrieve evidence, they all rely on certain heuristics that prevent them from satisfying all three requirements. In light of this, we propose a fact verification model named ReRead to retrieve evidence and verify claim that: (1) Train the evidence retriever to obtain interpretable evidence (i.e., faithfulness and plausibility criteria); (2) Train the claim verifier to revisit the evidence retrieved by the optimized evidence retriever to improve the accuracy. The proposed system is able to achieve significant improvements upon best-reported models under different settings.
Reducing Spurious Correlations for Relation Extraction by Feature Decomposition and Semantic Augmentation
Deep neural models have become mainstream in relation extraction (RE), yielding state-of-the-art performance. However, most existing neural models are prone to spurious correlations between input features and prediction labels, making the models suffer from low robustness and generalization.In this paper, we propose a spurious correlation reduction method for RE via feature decomposition and semantic augmentation (denoted as FDSA). First, we decompose the original sentence representation into class-related features and context-related features. To obtain better context-related features, we devise a contrastive learning method to pull together the context-related features of the anchor sentence and its augmented sentences, and push away the context-related features of different anchor sentences. In addition, we propose gradient-based semantic augmentation on context-related features in order to improve the robustness of the RE model. Experiments on four datasets show that our model outperforms the strong competitors.
Learned sparse document representations using a transformer-based neural model has been found to be attractive in both relevance effectiveness and time efficiency. This paper describes a representation sparsification scheme based on hard and soft thresholding with an inverted index approximation for faster SPLADE-based document retrieval. It provides analytical and experimental results on the impact of this learnable hybrid thresholding scheme.
The task of knowledge graph completion (KGC) is of great importance. To achieve scalability when dealing with large-scale knowledge graphs, recent works formulate KGC as a sequence-to-sequence process, where the incomplete triplet (input) and the missing entity (output) are both verbalized as text sequences. However, inference with these methods relies solely on the model parameters for implicit reasoning and neglects the use of KG itself, which limits the performance since the model lacks the capacity to memorize a vast number of triplets. To tackle this issue, we introduce ReSKGC, a Retrieval-enhanced Seq2seq KGC model, which selects semantically relevant triplets from the KG and uses them as evidence to guide output generation with explicit reasoning. Our method has demonstrated state-of-the-art performance on benchmark datasets Wikidata5M and WikiKG90Mv2, which contain about 5M and 90M entities, respectively.
Real recommendation systems contain various features, which are often high-dimensional, sparse, and difficult to learn effectively. In addition to numerical features, user reviews contain rich semantic information including user preferences, which are used as auxiliary features by researchers. The methods of supplementing data features based on reviews have certain effects. However, most of them simply concatenate review representations and other features together, without considering that the text representation contains a lot of noise information. In addition, the important intentions contained in user reviews are not modeled effectively. In order to solve the above problems, we propose a novel Review-based Multi-intention Contrastive Learning (RMCL) method. In detail, RMCL proposes an intention representation method based on mixed Gaussian distribution hypothesis. Further, RMCL adopts a multi-intention contrastive strategy, which establishes a fine-grained connection between user reviews and item reviews. Extensive experiments on five real-world datasets demonstrate significant improvements of our proposed RMCL model over the state-of-the-art methods.
Given a textual sentence provided by a user, the Temporal Language Grounding (TLG) task is defined as the process of finding a semantically relevant video moment or clip from an untrimmed video. In recent years, localization-based TLG methods have been explored, which adopt reinforcement learning to locate a clip from the video. However, these methods are not stable enough due to the stochastic exploration mechanism of reinforcement learning, which is sensitive to the reward. Therefore, providing a more flexible and reasonable reward has become a focus of attention for both academia and industry.
Inspired by the training process of chatGPT, we innovatively adopt a vision-language pre-training (VLP) model as a reward model, which provides flexible rewards to help the localization-based TLG task converge. Specifically, a reinforcement learning-based localization module is introduced to predict the start and end timestamps in multi-modal scenarios. Thereafter, we fine-tune a reward model based on a VLP model, even introducing some human feedback, which provides a flexible reward score for the localization module. In this way, our model is able to capture subtle differences of the untrimmed video. Extensive experiments on two datasets have well verified the effectiveness of our proposed solution.
Recently, there has been growing interest in integrating causal inference into recommender systems to answer the hypothetical question: “what would be the potential feedback when a user is recommended a product?” Various unbiased estimators, including Inverse Propensity Score (IPS) and Doubly Robust (DR), have been proposed to address this question. However, these estimators often assume that confounders are precisely observable, which is not always the case in real-world scenarios. To address this challenge, we propose a novel method called Adversarial Training-based IPS (AT-IPS), which uses adversarial training to handle noisy confounders. The proposed method defines a feasible region for the confounders, obtains the worst-case noise (adversarial noise) within the region, and jointly trains the propensity model and the prediction model against such noise to improve their robustness. We provide a theoretical analysis of the accuracy-robustness tradeoff of AT-IPS and demonstrate its superior performance compared to other popular estimators on both real-world and semi-synthetic datasets.
Going beyond accuracy in the evaluation of a recommender system is an aspect that is receiving more and more attention. Among the many perspectives that can be considered, the impact of presentation bias is of central importance. Under presentation bias, the attention of the users to the items in a recommendation list changes, thus affecting their possibility to be considered and the effectiveness of a model. Page-wise within-subject studies are widely employed in the recommender systems literature to compare algorithms by displaying their results in parallel. However, no study has ever been performed to assess the impact of presentation bias in this context. In this paper, we characterize how presentation bias affects different layout options, which present the results in column- or row-wise fashion. Concretely, we present a user study where six layout variants are proposed to the users in a page-wise within-subject setting, so as to evaluate their perception of the displayed recommendations. Results show that presentation bias impacts users clicking behavior (low-level feedback), but not so much the perceived performance of a recommender system (high-level feedback). Source codes and raw results are available at https://tinyurl.com/PresBiasSIGIR2023.
Searching for Products in Virtual Reality: Understanding the Impact of Context and Result Presentation on User Experience
Immersive technologies such as virtual reality (VR) and head-mounted displays (HMD) have seen increased adoption in recent years. In this work, we study two factors that influence users’ experience when shopping in VR through voice queries: (1) context alignment of the search environment and (2) the level of detail on the Search Engine Results Page (SERP). To this end, we developed a search system for VR and conducted a within-subject exploratory study (N=18) to understand the impact of the two experimental conditions. Our results suggest that both context alignment and SERP are important factors for information-seeking in VR, which present unique opportunities and challenges. More specifically, based on our findings, we suggest that search systems for VR must be able to: (1) provide cues for information-seeking in both the VR environment and SERP, (2) distribute attention between the VR environment and the search interface, (3) reduce distractions in the VR environment and (4) provide a ”sense of control” to search in the VR environment.
Low-resource relation extraction (LRE) aims to extract potential relations from limited labeled corpus to handle the problem of scarcity of human annotations. Previous works mainly consist of two categories of methods: (1) Self-training methods, which improve themselves through the models’ predictions, thus suffering from confirmation bias when the predictions are wrong. (2) Self-ensembling methods, which learn task-agnostic representations, therefore, generally do not work well for specific tasks. In our work, we propose a novel LRE architecture named SelfLRE, which leverages two complementary modules, one module uses self-training to obtain pseudo-labels for unlabeled data, and the other module uses self-ensembling learning to obtain the task-agnostic representations, and leverages the existing pseudo-labels to refine the better task-specific representations on unlabeled data. The two models are jointly trained through multi-task learning to iteratively improve the effect of LRE task. Experiments on three public datasets show that SelfLRE achieves 1.81% performance gain over the SOTA baseline. Source code is available at: https://github.com/THU-BPM/SelfLRE.
Graph Neural Networks (GNNs) have achieved impressive performance in collaborative filtering. However, recent studies show that GNNs tend to yield inferior performance when the distributions of training and test data are not aligned well. Moreover, training GNNs often requires optimizing non-convex neural networks with an abundance of local and global minima, which may differ widely in their performance at test time. Thus, it is essential to develop an optimization strategy that can choose the minima carefully, which can yield strong generalization performance on unseen data. Here we propose an effective training schema, called gSAM, under the principle that theflatter minima has a better generalization ability than thesharper ones. To achieve this goal, gSAM regularizes the flatness of the weight loss landscape by forming a bi-level optimization: the outer problem conducts the standard model training while the inner problem helps the model jump out of the sharp minima. Experimental results show the superiority of our gSAM.
Simple Approach for Aspect Sentiment Triplet Extraction Using Span-Based Segment Tagging and Dual Extractors
Aspect sentiment triplet extraction (ASTE) is a task which extracts aspect terms, opinion terms, and sentiment polarities as triplets from review sentences. Existing approaches have developed bidirectional structures for term interaction. Sentiment polarities are subsequently extracted from aspect-opinion pairs. These solutions suffer from: 1) high dependency on custom bidirectional structures, 2) inadequate representation of the information through existing tagging schemes, and 3) insufficient usage of all available sentiment data. To address the above issues, we propose a simple span-based solution named SimSTAR with Segment Tagging And dual extRactors. SimSTAR does not introduce any additional bidirectional mechanism. The segment tagging scheme is capable to indicate all possible cases of spans and reveals more information through negative labels. Dual extractors are employed to make the sentiment extraction independent of the term extraction. We evaluate our model on four ASTE datasets. The experimental results show that our simple method achieves state-of-the-art performance.
The problem of inner product search (IPS) is important in many fields. Although maximum inner product search (MIPS) is often considered, its result is usually skewed and static. Users are hence hard to obtain diverse and/or new items by using the MIPS problem. Motivated by this, we formulate a new problem, namely the fair and independent IPS problem. Given a query, a threshold, and an output size k, this problem randomly samples k items from a set of items such that the inner product of the query and item is not less than the threshold. For each item that satisfies the threshold, this problem is fair, because the probability that such an item is outputted is equal to that for each other item. This fairness can yield diversity and novelty, but this problem faces a computational challenge. Some existing (M)IPS techniques can be employed in this problem, but they require O(n) or o(n) time, where n is the dataset size. To scale well to large datasets, we propose a simple yet efficient algorithm that runs in O(log n + k) expected time. We conduct experiments using real datasets, and the results demonstrate that our algorithm is up to 330 times faster than baselines.
The advent of personalized news recommendation has given rise to increasingly complex recommender architectures. Most neural news recommenders rely on user click behavior and typically introduce dedicated user encoders that aggregate the content of clicked news into user embeddings (early fusion). These models are predominantly trained with standard point-wise classification objectives. The existing body of work exhibits two main shortcomings: (1) despite general design homogeneity, direct comparisons between models are hindered by varying evaluation datasets and protocols; (2) it leaves alternative model designs and training objectives vastly unexplored. In this work, we present a unified framework for news recommendation, allowing for a systematic and fair comparison of news recommenders across several crucial design dimensions: (i) candidate-awareness in user modeling, (ii) click behavior fusion, and (iii) training objectives. Our findings challenge the status quo in neural news recommendation. We show that replacing sizable user encoders with parameter-efficient dot products between candidate and clicked news embeddings (late fusion) often yields substantial performance gains. Moreover, our results render contrastive training a viable alternative to point-wise classification objectives.
In this paper we introduce SimTDE, a simple knowledge distillation framework to compress sentence embeddings transformer models with minimal performance loss and significant size and latency reduction. SimTDE effectively distills large and small transformers via a compact token embedding block and a shallow encoding block, connected with a projection layer, relaxing dimension match requirement. SimTDE simplifies distillation loss to focus only on token embedding and sentence embedding. We evaluate on standard semantic textual similarity (STS) tasks and entity resolution (ER) tasks. It achieves 99.94% of the state-of-the-art (SOTA) SimCSE-Bert-Base performance with 3 times size reduction and 96.99% SOTA performance with 12 times size reduction on STS tasks. It also achieves 99.57% of teacher’s performance on multi-lingual ER data with a tiny transformer student model of 1.4M parameters and 5.7MB size. Moreover, compared to other distilled transformers SimTDE is 2 times faster at inference given similar size and still 1.17 times faster than a model 33% smaller (e.g. MiniLM). The easy-to-adopt framework, strong accuracy and low latency of SimTDE can widely enable runtime deployment of SOTA sentence embeddings.
A recent trend in multimodal retrieval is related to postprocessing test set results via the dual-softmax loss (DSL). While this approach can bring significant improvements, it usually presumes that an entire matrix of test samples is available as DSL input. This work introduces a new postprocessing approach based on Sinkhorn transformations that outperforms DSL. Further, we propose a new postprocessing setting that does not require access to multiple test queries. We show that our approach can significantly improve the results of state of the art models such as CLIP4Clip, BLIP, X-CLIP, and DRL, thus achieving a new state-of-the-art on several standard text-video retrieval datasets both with access to the entire test set and in the single-query setting.
In dense retrieval, prior work has largely improved retrieval effectiveness using multi-vector dense representations, exemplified by ColBERT. In sparse retrieval, more recent work, such as SPLADE, demonstrated that one can also learn sparse lexical representations to achieve comparable effectiveness while enjoying better interpretability. In this work, we combine the strengths of both the sparse and dense representations for first-stage retrieval. Specifically, we propose SparseEmbed – a novel retrieval model that learns sparse lexical representations with contextual embeddings. Compared with SPLADE, our model leverages the contextual embeddings to improve model expressiveness. Compared with ColBERT, our sparse representations are trained end-to-end to optimize both efficiency and effectiveness.
Work in information retrieval has largely been centered around ranking and relevance: given a query, return some number of results ordered by relevance to the user. The problem of result list truncation, or where to truncate the ranked list of results, however, has received less attention despite being crucial in a variety of applications. Such truncation is a balancing act between the overall relevance, or usefulness of the results, with the user cost of processing more results. Result list truncation can be challenging because relevance scores are often not well-calibrated. This is particularly true in large-scale IR systems where documents and queries are embedded in the same metric space and a query’s nearest document neighbors are returned during inference. Here, relevance is inversely proportional to the distance between the query and candidate document, but what distance constitutes relevance varies from query to query and changes dynamically as more documents are added to the index. In this work, we propose Surprise scoring, a statistical method that leverages the Generalized Pareto Distribution that arises in extreme value theory to produce interpretable and calibrated relevance scores at query time using nothing more than the ranked scores. We demonstrate its effectiveness on the result list truncation task across image, text, and IR datasets and compare it to both classical and recent baselines. We draw connections to hypothesis testing and p-values.
Recent work has shown that incorporating explanations into the output generated by large language models (LLMs) can significantly enhance performance on a broad spectrum of reasoning tasks. Our study extends these findings by demonstrating the benefits of explanations for neural rankers. By utilizing LLMs such as GPT-3.5 to enrich retrieval datasets with explanations, we trained a sequence-to-sequence ranking model, dubbed ExaRanker, to generate relevance labels and explanations for query-document pairs. The ExaRanker model, finetuned on a limited number of examples and synthetic explanations, exhibits performance comparable to models finetuned on three times more examples, but without explanations. Moreover, incorporating explanations imposes no additional computational overhead into the reranking step and allows for on-demand explanation generation. The codebase and datasets used in this study will be available at https://github.com/unicamp-dl/ExaRanker
Meta-learning has become a widely used method for the user cold-start problem in recommendation systems, as it allows the model to learn from similar learning tasks and transfer the knowledge to new tasks. However, most existing meta-learning methods do not consider the temporal factor of users’ preferences, which is crucial for news recommendation scenarios where news streams change dynamically over time. In this paper, we propose Time-Aware Meta-Learning (TAML), a novel framework that focuses on cold-start users in news recommendation systems. TAML factorizes user preferences into time-specifc and time-shift representations that jointly affect users’ news preferences. These temporal factors are further incorporated into the meta-learning framework to achieve accurate and timely cold-start recommendations. Extensive experiments are conducted on two real-world datasets, demonstrating the superior performance of TAML over state-of-the-art methods.
Due to recent advances in pose-estimation methods, human motion can be extracted from a common video in the form of 3D skeleton sequences. Despite wonderful application opportunities, effective and efficient content-based access to large volumes of such spatio-temporal skeleton data still remains a challenging problem. In this paper, we propose a novel content-based text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural-language textual description. To define baselines for this uncharted task, we employ the BERT and CLIP language representations to encode the text modality and successful spatio-temporal models to encode the motion modality. We additionally introduce our transformer-based approach, called Motion Transformer (MoT), which employs divided space-time attention to effectively aggregate the different skeleton joints in space and time. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions. Finally, we set up a common evaluation protocol by defining qualitative metrics for assessing the quality of the retrieved motions, targeting the two recently-introduced KIT Motion-Language and HumanML3D datasets. The code for reproducing our results is available here: https://github.com/mesnico/text-to-motion-retrieval.
Deep learning-based recommender systems have become an integral part of several online platforms. However, their black-box nature emphasizes the need for explainable artificial intelligence (XAI) approaches to provide human-understandable reasons why a specific item gets recommended to a given user. One such method is counterfactual explanation (CF). While CFs can be highly beneficial for users and system designers, malicious actors may also exploit these explanations to undermine the system’s security.
In this work, we propose H-CARS, a novel strategy to poison recommender systems via CFs. Specifically, we first train a logical-reasoning-based surrogate model on training data derived from counterfactual explanations. By reversing the learning process of the recommendation model, we thus develop a proficient greedy algorithm to generate fabricated user profiles and their associated interaction records for the aforementioned surrogate model. Our experiments, which employ a well-known CF generation method and are conducted on two distinct datasets, show that H-CARS yields significant and successful attack performance.
The MS MARCO-passage dataset has been the main large-scale dataset open to the IR community and it has fostered successfully the development of novel neural retrieval models over the years. But, it turns out that two different corpora of MS MARCO are used in the literature, the official one and a second one where passages were augmented with titles, mostly due to the introduction of the Tevatron code base. However, the addition of titles actually leaks relevance information, while breaking the original guidelines of the MS MARCO-passage dataset. In this work, we investigate the differences between the two corpora and demonstrate empirically that they make a significant difference when evaluating a new method. In other words, we show that if a paper does not properly report which version is used, reproducing fairly its results is basically impossible. Furthermore, given the current status of reviewing, where monitoring state-of-the-art results is of great importance, having two different versions of a dataset is a large problem. This is why this paper aims to report the importance of this issue so that researchers can be made aware of this problem and appropriately report their results.
Relation extraction (RE) aims to extract potential relations according to the context of two entities, thus, deriving rational contexts from sentences plays an important role. Previous works either focus on how to leverage the entity information (e.g., entity types, entity verbalization) to inference relations, but ignore context-focused content, or use counterfactual thinking to remove the model’s bias of potential relations in entities, but the relation reasoning process will still be hindered by irrelevant content. Therefore, how to preserve relevant content and remove noisy segments from sentences is a crucial task. In addition, retained content needs to be fluent enough to maintain semantic coherence and interpretability. In this work, we propose a novel rationale extraction framework named RE2, which leverages two continuity and sparsity factors to obtain relevant and coherent rationales from sentences. To solve the problem that the gold rationales are not labeled, RE2 applies an optimizable binary mask to each token in the sentence, and adjust the rationales that need to be selected according to the relation label. Experiments on four datasets show that RE2 surpasses baselines.
Knowledge tracing (KT) is the problem of predicting students’ future performance based on their historical interaction sequences. With the advanced capability of capturing contextual long-term dependency, attention mechanism becomes one of the essential components in many deep learning based KT (DLKT) models. In spite of the impressive performance achieved by these attentional DLKT models, many of them are often vulnerable to run the risk of overfitting, especially on small-scale educational datasets. Therefore, in this paper, we propose sparseKT, a simple yet effective framework to improve the robustness and generalization of the attention based DLKT approaches. Specifically, we incorporate a k-selection module to only pick items with the highest attention scores. We propose two sparsification heuristics: (1) soft-thresholding sparse attention and (2) top-K sparse attention. We show that our sparseKT is able to help attentional KT models get rid of irrelevant student interactions and improve the predictive performance when compared to 11 state-of-the-art KT models on three publicly available real-world educational datasets. To encourage reproducible research, we make our data and code publicly available at https://github.com/pykt-team/pykt-toolkit1..
Nowadays safety has become one of the most critical factors for ride-hailing service. Ride-hailing platforms have conducted meticulous background checks for drivers to minimize the risk of abnormal trips, e.g. violence and sexual assault. However, current methods are labor-consuming and highly rely on the personal information of drivers, which may harm the fairness of the order dispatching system. In this paper, we utilize the trip trajectories as inputs and propose a dual variational auto-encoder(VAE) framework, namely TripSafe, to estimate the probability of abnormal safety incidents. Specifically, TripSafe models the moving behavior and route information, as two independent components and employs VAEs to pre-train generative models for normal trips. Then, a fusion network is adopted to fine-tune the whole model with a few labeled samples. In practice, TripSafe monitors the data update and calculate the anomaly score of partial-observed trips in real-time. Experiments on real ridehailing data show that TripSafe is superior to the state-of-the-art baselines with about 14.2%~28.9% improvements on F1 score.
The problem of signed network embedding (SNE) aims to represent nodes in a given signed network as low-dimensional vectors. While several SNE methods based on graph convolutional networks (GCN) have been proposed, we point out that they significantly rely on the assumption that the decades-old balance theory always holds in the real world. To address this limitation, we propose a novel GCN-based SNE approach, named as TrustSGCN, which measures the trustworthiness on edge signs for high-order relationships inferred by balance theory and corrects incorrect embedding propagation based on the trustworthiness. The experiments on four real-world signed network datasets demonstrate that TrustSGCN consistently outperforms five state-of-the-art GCN-based SNE methods. The code is available at https://github.com/kmj0792/TrustSGCN.
uCTRL: Unbiased Contrastive Representation Learning via Alignment and Uniformity for Collaborative Filtering
Because implicit user feedback for the collaborative filtering (CF) models is biased toward popular items, CF models tend to yield recommendation lists with popularity bias. Previous studies have utilized inverse propensity weighting (IPW) or causal inference to mitigate this problem. However, they solely employ pointwise or pairwise loss functions and neglect to adopt a contrastive loss function for learning meaningful user and item representations. In this paper, we propose Unbiased ConTrastive Representation Learning (uCTRL), optimizing alignment and uniformity functions derived from the InfoNCE loss function for CF models. Specifically, we formulate an unbiased alignment function used in uCTRL. We also devise a novel IPW estimation method that removes the bias of both users and items. Despite its simplicity, uCTRL equipped with existing CF models consistently outperforms state-of-the-art unbiased recommender models, up to 12.22% for Recall@20 and 16.33% for NDCG@20 gains, on four benchmark datasets.
Unbiased Pairwise Learning from Implicit Feedback for Recommender Systems without Biased Variance Control
Generally speaking, the model training for recommender systems can be based on two types of data, namely explicit feedback and implicit feedback. Moreover, because of its general availability, we see wide adoption of implicit feedback data, such as click signal. There are mainly two challenges for the application of implicit feedback. First, implicit data just includes positive feedback. Therefore, we are not sure whether the non-interacted items are really negative or positive but not displayed to the corresponding user. Moreover, the relevance of rare items is usually underestimated since much fewer positive feedback of rare items is collected compared with popular ones. To tackle such difficulties, both pointwise and pairwise solutions are proposed before for unbiased relevance learning. As pairwise learning suits well for the ranking tasks, the previously proposed unbiased pairwise learning algorithm already achieves state-of-the-art performance. Nonetheless, the existing unbiased pairwise learning method suffers from high variance. To get satisfactory performance, non-negative estimator is utilized for practical variance control but introduces additional bias. In this work, we propose an unbiased pairwise learning method, named UPL, with much lower variance to learn a truly unbiased recommender model. Extensive offline experiments on real world datasets and online A/B testing demonstrate the superior performance of our proposed method.
Graph Neural Network (GNN)-based models have become the mainstream approach for recommender systems. Despite the effectiveness, they are still suffering from the cold-start problem, i.e., recommend for few-interaction items. Existing GNN-based recommendation models to address the cold-start problem mainly focus on utilizing auxiliary features of users and items, leaving the user-item interactions under-utilized. However, embeddings distributions of cold and warm items are still largely different, since cold items’ embeddings are learned from lower-popularity interactions, while warm items’ embeddings are from higher-popularity interactions. Thus, there is a seesaw phenomenon, where the recommendation performance for the cold and warm items cannot be improved simultaneously. To this end, we proposed a Uncertainty-aware Consistency learning framework for Cold-start item recommendation (shorten as UCC) solely based on user-item interactions. Under this framework, we train the teacher model (generator) and student model (recommender) with consistency learning, to ensure the cold items with additionally generated low-uncertainty interactions can have similar distribution with the warm items. Therefore, the proposed framework improves the recommendation of cold and warm items at the same time, without hurting any one of them. Extensive experiments on benchmark datasets demonstrate that our proposed method significantly outperforms state-of-the-art methods on both warm and cold items, with an average performance improvement of 27.6%.
In industrial recommendation systems, both data sizes and computational resources vary across different scenarios. For scenarios with limited data, data sparsity can lead to a decrease in model performance. Heterogeneous knowledge distillation-based transfer learning can be used to transfer knowledge from models in data-rich domains. However, in recommendation systems, the target domain possesses specific privileged features that significantly contribute to the model. While existing knowledge distillation methods have not taken these features into consideration, leading to suboptimal transfer weights. To overcome this limitation, we propose a novel algorithm called Uncertainty-based Heterogeneous Privileged Knowledge Distillation (UHPKD). Our method aims to quantify the knowledge of both the source and target domains, which represents the uncertainty of the models. This approach allows us to derive transfer weights based on the knowledge gain, which captures the difference in knowledge between the source and target domains. Experiments conducted on both public and industrial datasets demonstrate the superiority of our UHPKD algorithm compared to other state-of-the-art methods.
In this work, we present an unsupervised retrieval method with contrastive learning on web anchors. The anchor text describes the content that is referenced from the linked page. This shows similarities to search queries that aim to retrieve pertinent information from relevant documents. Based on their commonalities, we train an unsupervised dense retriever, Anchor-DR, with a contrastive learning task that matches the anchor text and the linked document. To filter out uninformative anchors (such as “homepage” or other functional anchors), we present a novel filtering technique to only select anchors that contain similar types of information as search queries. Experiments show that Anchor-DR outperforms state-of-the-art methods on unsupervised dense retrieval by a large margin (e.g., by 5.3% NDCG@10 on MSMARCO). The gain of our method is especially significant for search and question answering tasks. Our analysis further reveals that the pattern of anchor-document pairs is similar to that of search query-document pairs. Code available at https://github.com/Veronicium/AnchorDR.
Dialogue Topic Segmentation (DTS) plays an essential role in a variety of dialogue modeling tasks. Previous DTS methods either focus on semantic similarity or dialogue coherence to assess topic similarity for unsupervised dialogue segmentation. However, the topic similarity cannot be fully identified via semantic similarity or dialogue coherence. In addition, the unlabeled dialogue data, which contains useful clues of utterance relationships, remains underexploited. In this paper, we propose a novel unsupervised DTS framework, which learns topic-aware utterance representations from unlabeled dialogue data through neighboring utterance matching and pseudo-segmentation. Extensive experiments on two benchmark datasets (i.e., DialSeg711 and Doc2Dial) demonstrate that our method significantly outperforms the strong baseline methods. For reproducibility, we provide our code and data at: https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/dial-start.
A query performance prediction (QPP) method predicts the effectiveness of an IR system for a given query. While unsupervised approaches have been shown to work well for statistical IR models, it is likely that these approaches would yield limited effectiveness for neural ranking models (NRMs) because the retrieval scores of these models lie within a short range unlike their statistical counterparts. In this work, we propose to leverage a pairwise inference-based NRM’s (specifically, DuoT5) output to accumulate evidences on the pairwise believes of one document ranked above the other. We hypothesize that the more consistent these pairwise likelihoods are, the higher is the likelihood of the retrieval to be of better quality, thus yielding a higher QPP score. We conduct our experiments on the TREC-DL dataset leveraging pairwise likelihoods from an auxiliary model DuoT5. Our experiments demonstrate that the proposed method called Pairwise Rank Preference-based QPP (QPP-PRP) leads to significantly better results than a number of standard unsupervised QPP baselines on several NRMs.
In recommender systems (RSs), inverse propensity score (IPS) has been a key technique to mitigate popularity bias by decreasing the contribution of popular items in modeling user-item interactions. However, conventional IPS treats all users equally, which tends to over-debias the popularity-insensitive (PI) users and under-debias the popularity-sensitive (PS) users. Furthermore, in such a treatment, IPS only performs slightly well on the debiased test while does not work on the normal biased test. To this end, we propose a user-dependent IPS (UDIPS in short) method, which adaptively conducts propensity estimation for each user-item pair based on the user’s sensitivity to item popularity. Like IPS, our theoretical analysis validates the unbiasedness of UDIPS. Remarkably, our solution is model-agnostic and can be easily used to upgrade current unbiased recommenders. We implemented it in four state-of-the-art models for unbiased recommendation, and experimental results on two benchmark datasets demonstrate the effectiveness of our method in both unbiased and normal biased test.
In recent years, pairwise methods, such as Bayesian Personalized Ranking (BPR), have gained significant attention in the field of collaborative filtering for recommendation systems. Group BPR is an extension of BPR that incorporates user groups to relax the strict assumption of independence between two users. However, the reliability of its user groups may be compromised as they only focus on a few behavioral similarities. To address this problem, this paper proposes a new entropy-weighted similarity measure for implicit feedback to quantify the relation between two users and sample like-minded user groups. We first introduce the group preference into several pairwise ranking algorithms and then utilize the entropy-weighted similarity to sample groups to further improve these algorithms. Unlike other approaches that rely solely on common item ratings, our method incorporates global information into the similarity measure, resulting in a more reliable approach to group sampling. We conducted experiments on two real-world datasets and evaluated our method using different metrics. The results show that our method can construct better user groups from sparse data and produce more accurate recommendations. Our approach can be applied to a wide range of recommendation systems, and this can significantly improve the performance of pairwise ranking algorithms, making it an effective tool for pairwise ranking.
Scientific document classification is a critical task for a wide range of applications, but the cost of collecting human-labeled data can be prohibitive. We study scientific document classification using label names only. In scientific domains, label names often include domain-specific concepts that may not appear in the document corpus, making it difficult to match labels and documents precisely. To tackle this issue, we propose WanDeR, which leverages dense retrieval to perform matching in the embedding space to capture the semantics of label names. We further design the label name expansion module to enrich its representations. Lastly, a self-training step is used to refine the predictions. The experiments on three datasets show that WanDeR outperforms the best baseline by 11.9%. Our code will be published at https://github.com/ritaranx/wander.
We present a study of Tip-of-the-tongue (ToT) retrieval for music, where a searcher is trying to find an existing music entity, but is unable to succeed as they cannot accurately recall important identifying information. ToT information needs are characterized by complexity, verbosity, uncertainty, and possible false memories. We make four contributions. (1) We collect a dataset – TOTMUSIC–of 2,278 information needs and ground truth answers. (2) We introduce a schema for these information needs and show that they often involve multiple modalities encompassing several Music IR sub-tasks such as lyric search, audio-based search, audio fingerprinting, and text search. (3) We underscore the difficulty of this task by benchmarking a standard text retrieval approach on this dataset. (4) We investigate the efficacy of query reformulations generated by a Large Language Model (LLM), and show that they are not as effective as simply employing the entire information need as a query–leaving several open questions for future research.
As social networks become further entrenched in modern society, it becomes increasingly important to understand and predict how information (e.g., news coverage of a given event) is propagated across social media (i.e., information pathway), which helps the understandings of the impact of real-world information. Thus, in this paper, we propose a novel task, Information Pathway Prediction (IPP), which depicts the propagation paths of a given passage as a community tree (rooted at the information source) on constructed community interaction graphs where we first aggregate individual users into communities formed around news sources and influential users, and then elucidate the patterns of information dissemination across media based on such community nodes. We argue that this is an important and useful task because, on one hand, community-level interactions offer more stability than those at the user level; on the other hand, individual users are often influenced by their community, and modeling community-level information propagation will help the traditional link-prediction problem. To tackle the IPP task, we introduce Lightning, a novel content-aware link prediction GNN model and demonstrate using a large Twitter dataset consisting of all COVID related tweets that Lightning outperforms state-of-the-art link prediction baselines by a significant margin.
Which Matters Most in Making Fund Investment Decisions? A Multi-granularity Graph Disentangled Learning Framework
In this paper, we highlight that both conformity and risk preference matter in making fund investment decisions beyond personal interest and seek to jointly characterize these aspects in a disentangled manner. Consequently, we develop a novel Multi-granularity Graph Disentangled Learning framework named MGDL to effectively perform intelligent matching of fund investment products. Benefiting from the well-established fund graph and the attention module, multi-granularity user representations are derived from historical behaviors to separately express personal interest, conformity and risk preference in a fine-grained way. To attain stronger disentangled representations with specific semantics, MGDL explicitly involve two self-supervised signals, ie fund type based contrasts and fund popularity. Extensive experiments in offline and online environments verify the effectiveness of MGDL.
WSFE: Wasserstein Sub-graph Feature Encoder for Effective User Segmentation in Collaborative Filtering
Maximizing the user-item engagement based on vectorized embeddings is a standard procedure of recent recommender models. Despite the superior performance for item recommendations, these methods however implicitly deprioritize the modeling of user-wise similarity in the embedding space; consequently, identifying similar users is underperforming, and additional processing schemes are usually required otherwise. To avoid thorough model re-training, we propose WSFE, a model-agnostic and training-free representation encoder, to be flexibly employed on the fly for effective user segmentation. Underpinned by the optimal transport theory, the encoded representations from WSFE present a matched user-wise similarity/distance measurement between the realistic and embedding space. We incorporate WSFE into six state-of-the-art recommender models and conduct extensive experiments on six real-world datasets. The empirical analyses well demonstrate the superiority and generality of WSFE to fuel multiple downstream tasks with diverse underlying targets in recommendation.
Existing user simulators (USs) for task-oriented dialogue systems only model user behaviour on semantic and natural language levels without considering the user persona and emotions. Optimising dialogue systems with generic user policies, which cannot model diverse user behaviour driven by different emotional states, may result in a high drop-off rate when deployed in the real world. Thus, we present EmoUS, a user simulator that learns to simulate user emotions alongside user behaviour. EmoUS generates user emotions, semantic actions, and natural language responses based on the user goal, the dialogue history, and the user persona. By analysing what kind of system behaviour elicits what kind of user emotions, we show that EmoUS can be used as a probe to evaluate a variety of dialogue systems and in particular their effect on the user’s emotional state. Developing such methods is important in the age of large language model chat-bots and rising ethical concerns.
SESSION: Reproducibility Papers
In several recommendation scenarios, including next basket recommendation, the importance of repetition and exploration has been discovered and studied. Sequential recommenders (SR) aim to infer a user’s preferences and suggest the next item for them to interact with based on their historical interaction sequences. There has not been a systematic analysis of sequential recommenders from the perspective of repetition and exploration. As a result, it is unclear how these models, that are typically optimized for accuracy, perform in terms of repetition and exploration, as well as the potential drawbacks of deploying them in real applications.
In this paper, we examine whether repetition and exploration are important dimensions in the sequential recommendation scenario. We consider this generalizability question both from a user-centered and an item-centered perspective. Towards the latter, we define item repeat exposure and item explore exposure and examine the recommendation performance of sequential recommendation models in terms of both accuracy and exposure from the perspective of repetition and exploration. We find that (i) there is an imbalance in accuracy and difficulty w.r.t. repetition and exploration in SR scenarios, (ii) using the conventional average overall accuracy with a significance test does not fully represent a model’s recommendation accuracy, and (iii) accuracy-oriented sequential recommendation models may suffer from less/zero item explore exposure issue, where items are mostly (or even only) recommended to their repeat users and fail to reach their potential new users.
To analyze our findings, we remove repeat samples from the dataset, that often act as easy shortcuts, and focus on a pure exploration SR scenario. We find that (i) removing the repetition shortcut increases the recommendation novelty and helps users who prefer to consume novel items next, (ii) neural-based models fail to learn the basic characteristics of this pure exploration scenario and suffer from an inherent repetitive bias issue, (iii) using shared item embeddings in the prediction layer may skew recommendations to repeat items, and (iv) removing all repeat items to post-processing recommendation results leads to a substantial improvement on top of several SR methods.
Knowledge distillation plays a key role in boosting the effectiveness of rankers based on pre-trained language models (PLMs); this is achieved using an effective but inefficient large model to teach a more efficient student model. In the context of knowledge distillation for a student dense passage retriever, the balanced topic-aware sampling method has been shown to provide state-of-the-art effectiveness. This method intervenes in the creation of the training batches by creating batches that contain positive-negative pairs of passages from the same topic, and balancing the pairwise margins of the positive and negative passages.
In this paper, we reproduce the balanced topic-aware sampling method; we do so for both the dataset used for evaluation in the original work (MS MARCO) and for a dataset in a different domain, that of product search (Amazon shopping queries dataset) to study whether the original results generalize to a different context. We show that while we could not replicate the exact results from the original paper, we do confirm the original findings in terms of trends: balanced topic-aware sampling indeed leads to highly effective dense retrievers. These results partially generalize to the other search task we investigate, product search: although we observe the improvements are less significant compared to MS MARCO.
In addition to reproducing the original results and studying how the method generalizes to a different dataset, we also investigate a key aspect that influences the effectiveness of the method: the use of a hard margin threshold for negative sampling. This aspect was not studied in the original paper. With respect to hard margins, we find that while setting different hard margin values significantly influences the effectiveness of the student model, this impact is dataset-dependent — and indeed, it does depend on the score distributions exhibited by retrieval models on the dataset at hand. Our reproducibility code is available at https://github.com/ielab/TAS-B-Reproduction.
Reproducibility, Replicability, and Insights into Dense Multi-Representation Retrieval Models: from ColBERT to Col*
Dense multi-representation retrieval models, exemplified as ColBERT, estimate the relevance between a query and a document based on the similarity of their contextualised token-level embeddings. Indeed, by using contextualised token embeddings, dense retrieval, conducted as either exact or semantic matches, can result in increased effectiveness for both in-domain and out-of-domain retrieval tasks, indicating that it is an important model to study. However, the exact role that these semantic matches play is not yet well investigated. For instance, although tokenisation is one of the crucial design choices for various pretrained language models, its impact on the matching behaviour has not been examined in detail. In this work, we inspect the reproducibility and replicability of the contextualised late interaction mechanism by extending ColBERT to Col⋆ which implements the late interaction mechanism across various pretrained models and different types of tokenisers. As different tokenisation methods can directly impact the matching behaviour within the late interaction mechanism, we study the nature of matches occurring in different Col⋆ models, and further quantify the contribution of lexical and semantic matching on retrieval effectiveness. Overall, our experiments successfully reproduce the performance of ColBERT on various query sets, and replicate the late interaction mechanism upon different pretrained models with different tokenisers. Moreover, our experimental results yield new insights, such as: (i) semantic matching behaviour varies across different tokenisers; (ii) more specifically, high-frequency tokens tend to perform semantic matching than other token families; (iii) late interaction mechanism benefits more from lexical matching than semantic matching; (iv) special tokens, such as [CLS], play a very important role in late interaction.
Given a text query on a controversial topic, the task of Image Retrieval for Argumentation is to rank images according to how well they can be used to support a discussion on the topic. An important subtask therein is to determine the stance of the retrieved images, i.e., whether an image supports the pro or con side of the topic. In this paper, we conduct a comprehensive reproducibility study of the state of the art as represented by the CLEF’22 Touché lab and an in-house extension of it. Based on the submitted approaches, we developed a unified and modular retrieval process and reimplemented the submitted approaches according to this process. Through this unified reproduction (which also includes models not previously considered), we achieve an effectiveness improvement in argumentative image detection of up to 0.832 precision@10. However, despite this reproduction success, our study also revealed a previously unknown negative result: for stance detection, none of the reproduced or new approaches can convincingly beat a random baseline. To understand the apparent challenges inherent to image stance detection, we conduct a thorough error analysis and provide insight into potential new ways to approach this task.
Medical coding is the task of assigning medical codes to clinical free-text documentation. Healthcare professionals manually assign such codes to track patient diagnoses and treatments. Automated medical coding can considerably alleviate this administrative burden. In this paper, we reproduce, compare, and analyze state-of-the-art automated medical coding machine learning models. We show that several models underperform due to weak configurations, poorly sampled train-test splits, and insufficient evaluation. In previous work, the macro F1 score has been calculated sub-optimally, and our correction doubles it. We contribute a revised model comparison using stratified sampling and identical experimental setups, including hyperparameters and decision boundary tuning. We analyze prediction errors to validate and falsify assumptions of previous works. The analysis confirms that all models struggle with rare codes, while long documents only have a negligible impact. Finally, we present the first comprehensive results on the newly released MIMIC-IV dataset using the reproduced models. We release our code, model parameters, and new MIMIC-III and MIMIC-IV training and evaluation pipelines to accommodate fair future comparisons.
Query performance prediction (QPP) is a core task in information retrieval. The QPP task is to predict the retrieval quality of a search system for a query without relevance judgments. Research has shown the effectiveness and usefulness of QPP for ad-hoc search. Recent years have witnessed considerable progress in conversational search (CS). Effective QPP could help a CS system to decide an appropriate action to be taken at the next turn. Despite its potential, QPP for CS has been little studied. We address this research gap by reproducing and studying the effectiveness of existing QPP methods in the context of CS. While the task of passage retrieval remains the same in the two settings, a user query in CS depends on the conversational history, introducing novel QPP challenges. In particular, we seek to explore to what extent findings from QPP methods for ad-hoc search generalize to three CS settings: (i) estimating the retrieval quality of different query rewriting-based retrieval methods, (ii) estimating the retrieval quality of a conversational dense retrieval method, and (iii) estimating the retrieval quality for top ranks vs. deeper-ranked lists. Our findings can be summarized as follows: (i) supervised QPP methods distinctly outperform unsupervised counterparts only when a large-scale training set is available; (ii) point-wise supervised QPP methods outperform their list-wise counterparts in most cases; and (iii) retrieval score-based unsupervised QPP methods show high effectiveness in assessing the conversational dense retrieval method, ConvDR.
Main content extraction from web pages-sometimes also called boilerplate removal-has been a research topic for over two decades. Yet despite web pages being delivered in a machine-readable markup format, extracting the actual content is still a challenge today. Even with the latest HTML5 standard, which defines many semantic elements to mark content areas, web page authors do not always use semantic markup correctly or to its full potential, making it hard for automated systems to extract the relevant information. A high-precision, high-recall content extraction is crucial for downstream applications such as search engines, AI language tools, distraction-free reader modes in users’ browsers, and other general assistive technologies. For such a fundamental task, however, surprisingly few openly available extraction systems or training and benchmarking datasets exist. Even less research has gone into the rigorous evaluation and a true apples-to-apples comparison of the few extraction systems that do exist. To get a better grasp on the current state of the art in the field, we combine and clean eight existing human-labeled web content extraction datasets. On the combined dataset, we evaluate 14~competitive main content extraction systems and five baseline approaches. Finally, we build three ensembles as new state-of-the-art extraction baselines. We find that the performance of existing systems is quite genre-dependent and no single extractor performs best on all types of web pages.
SESSION: Perspective Papers
The rise in loosely-structured data available through text, images, and other modalities has called for new ways of querying them. Multimedia Information Retrieval has filled this gap and has witnessed exciting progress in recent years. Tasks such as search and retrieval of extensive multimedia archives have undergone massive performance improvements, driven to a large extent by recent developments in multimodal deep learning. However, methods in this field remain limited in the kinds of queries they support and, in particular, their inability to answer database-like queries. For this reason, inspired by recent work on neural databases, we propose a new framework, which we name Multimodal Neural Databases (MMNDBs). MMNDBs can answer complex database-like queries that involve reasoning over different input modalities, such as text and images, at scale. In this paper, we present the first architecture able to fulfill this set of requirements and test it with several baselines, showing the limitations of currently available models. The results show the potential of these new techniques to process unstructured data coming from different modalities, paving the way for future research in the area.
Recommendation has become a prominent area of research in the field of Information Retrieval (IR). Evaluation is also a traditional research topic in this community. Motivated by a few counter-intuitive observations reported in recent studies, this perspectives paper takes a fresh look at recommender systems from an evaluation standpoint. Rather than examining metrics like recall, hit rate, or NDCG, or perspectives like novelty and diversity, the key focus here is on how these metrics are calculated when evaluating a recommender algorithm. Specifically, the commonly used train/test data splits and their consequences are re-examined. We begin by examining common data splitting methods, such as random split or leave-one-out, and discuss why the popularity baseline is poorly defined under such splits. We then move on to explore the two implications of neglecting a global timeline during evaluation: data leakage and oversimplification of user preference modeling. Afterwards, we present new perspectives on recommender systems, including techniques for evaluating algorithm performance that more accurately reflect real-world scenarios, and possible approaches to consider decision contexts in user preference modeling.
Recommendation models that utilize unique identities (IDs for short) to represent distinct users and items have been state-of-the-art (SOTA) and dominated the recommender systems (RS) literature for over a decade. Meanwhile, the pre-trained modality encoders, such as BERT  and Vision Transformer , have become increasingly powerful in modeling the raw modality features of an item, such as text and images. Given this, a natural question arises: can a purely modality-based recommendation model (MoRec) outperforms or matches a pure ID-based model (IDRec) by replacing the itemID embedding with a SOTA modality encoder? In fact, this question was answered ten years ago when IDRec beats MoRec by a strong margin in both recommendation accuracy and efficiency.
We aim to revisit this ‘old’ question and systematically study MoRec from several aspects. Specifically, we study several sub-questions: (i) which recommendation paradigm, MoRec or IDRec, performs better in practical scenarios, especially in the general setting and warm item scenarios where IDRec has a strong advantage? does this hold for items with different modality features? (ii) can the latest technical advances from other communities (i.e., natural language processing and computer vision) translate into accuracy improvement for MoRec? (iii) how to effectively utilize item modality representation, can we use it directly or do we have to adjust it with new data? (iv) are there any key challenges that MoRec needs to address in practical applications? To answer them, we conduct rigorous experiments for item recommendations with two popular modalities, i.e., text and vision. We provide the first empirical evidence that MoRec is already comparable to its IDRec counterpart with an expensive end-to-end training method, even for warm item recommendation. Our results potentially imply that the dominance of IDRec in the RS field may be greatly challenged in the future. We release our code and other materials at https://github.com/westlake-repl/IDvs.MoRec.
Online platforms mediate access to opportunity: relevance-based rankings create and constrain options by allocating exposure to job openings and job candidates in hiring platforms, or sellers in a marketplace. In order to do so responsibly, these socially consequential systems employ various fairness measures and interventions, many of which seek to allocate exposure based on worthiness. Because these constructs are typically not directly observable, platforms must instead resort to using proxy scores such as relevance and infer them from behavioral signals such as searcher clicks. Yet, it remains an open question whether relevance fulfills its role as %a deservedness score such a worthiness score in high-stakes fair rankings.
In this paper, we combine perspectives and tools from the social sciences, information retrieval, and fairness in machine learning to derive a set of desired criteria that relevance scores should satisfy in order to meaningfully guide fairness interventions. We then empirically show that not all of these criteria are met in a case study of relevance inferred from biased user click data. We assess the impact of these violations on the estimated system fairness and analyze whether existing fairness interventions may mitigate the identified issues. Our analyses and results surface the pressing need for new approaches to relevance collection and generation that are suitable for use in fair ranking.
In real-world recommender model deployments, the models are typically retrained and deployed repeatedly. It is the rule-of-thumb to periodically retrain recommender models to capture up-to-date user behavior and item trends. However, the harm caused by delayed model updates has not been investigated extensively yet. in this perspective paper, we formulate the delayed model update problem and quantitatively demonstrate the delayed model update actually harms the model performance by increasing the number of cold users and cold items increase and decreasing overall model performances. These effects vary across different domains having different characteristics. Upon these findings, we further argue that although the delayed model update has negative effects on online recommender model deployment, yet it has not gathered enough attention from research communities. We argue our verification of the relationship between the model update cycle and model performance calls for further research such as faster model training, and more efficient data pipelines to keep the model more up-to-date with the latest user behaviors and item trends.
Ranking is at the core of Information Retrieval. Classic ranking optimization studies often treat ranking as a sorting problem with the assumption that the best performance of ranking would be achieved if we rank items according to their individual utility. Accordingly, considerable ranking metrics have been developed and learning-to-rank algorithms that have been designed to optimize these simple performance metrics have been widely used in modern IR systems. As applications evolve, however, people’s need for information retrieval have shifted from simply retrieving relevant documents to more advanced information services that satisfy their complex working and entertainment needs. Thus, more complicated and user-centric objectives such as user satisfaction and engagement have been adopted to evaluate modern IR systems today. Those objectives, unfortunately, are difficult to be optimized under existing learning-to-rank frameworks as they are subject to great variance and complicated structures that cannot be explicitly explained or formulated with math equations like those simple performance metrics. This leads to the following research question — how to optimize result ranking for complex ranking metrics without knowing their internal structures? To address this question, we conduct formal analysis on the limitation of existing ranking optimization techniques and describe three research tasks in Metric-agnostic Ranking Optimization: (1) develop surrogate metric models to simulate complex online ranking metrics on offline data; (2) develop differentiable ranking optimization frameworks for list or session level performance metrics without fine-grained supervision signals; and (3) develop efficient parameter exploration and exploitation techniques for ranking optimization in metric-agnostic scenarios. Through the discussion of potential solutions to these tasks, we hope to encourage more people to look into the problem of ranking optimization in complex search and recommendation scenarios.
SESSION: Resource Papers
Passage ranking involves two stages: passage retrieval and passage re-ranking, which are important and challenging topics for both academics and industries in the area of Information Retrieval (IR). However, the commonly-used datasets for passage ranking usually focus on the English language. For non-English scenarios, such as Chinese, the existing datasets are limited in terms of data scale, fine-grained relevance annotation and false negative issues. To address this problem, we introduce T2Ranking, a large-scale Chinese benchmark for passage ranking. T2Ranking comprises more than 300K queries and over 2M unique passages from real-world search engines. Expert annotators are recruited to provide 4-level graded relevance scores (fine-grained) for query-passage pairs instead of binary relevance judgments (coarse-grained). To ease the false negative issues, more passages with higher diversities are considered when performing relevance annotations, especially in the test set, to ensure a more accurate evaluation. Apart from the textual query and passage data, other auxiliary resources are also provided, such as query types and XML files of documents which passages are generated from, to facilitate further studies. To evaluate the dataset, commonly used ranking models are implemented and tested on T2Ranking as baselines. The experimental results show that T2Ranking is challenging and there is still scope for improvement. The full data and all codes are available at https://github.com/THUIR/T2Ranking/.
BizGraphQA: A Dataset for Image-based Inference over Graph-structured Diagrams from Business Domains
Graph-structured diagrams, such as enterprise ownership charts or management hierarchies, are a challenging medium for deep learning models as they not only require the capacity to model language and spatial relations but also the topology of links between entities and the varying semantics of what those links represent. Devising Question Answering models that automatically process and understand such diagrams have vast applications to many enterprise domains, and can move the state-of-the-art on multimodal document understanding to a new frontier. Curating real-world datasets to train these models can be difficult, due to scarcity and confidentiality of the documents where such diagrams are included. Recently released synthetic datasets are often prone to repetitive structures that can be memorized or tackled using heuristics. In this paper, we present a collection of 10,000 synthetic graphs that faithfully reflect properties of real graphs in four business domains, and are realistically rendered within a PDF document with varying styles and layouts. In addition, we have generated over 130,000 question instances that target complex graphical relationships specific to each domain. We hope this challenge will encourage the development of models capable of robust reasoning about graph structured images, which are ubiquitous in numerous sectors in business and across scientific disciplines.
Towards Building Voice-based Conversational Recommender Systems: Datasets, Potential Solutions and Prospects
Conversational recommender systems (CRSs) have become crucial emerging research topics in the field of RSs, thanks to their natural advantages of explicitly acquiring user preferences via interactive conversations and revealing the reasons behind recommendations. However, the majority of current CRSs are text-based, which is less user-friendly and may pose challenges for certain users, such as those with visual impairments or limited writing and reading abilities. Therefore,for the first time, this paper investigates the potential of voice-based CRS (VCRSs) to revolutionize the way users interact with RSs in a natural, intuitive, convenient, and accessible fashion. To support such studies, we create two VCRSs benchmark datasets in the e-commerce and movie domains, after realizing the lack of such datasets through an exhaustive literature review. Specifically, we first empirically verify the benefits and necessity of creating such datasets. Thereafter, we convert the user-item interactions to text-based conversations through the ChatGPT-driven prompts for generating diverse and natural templates, and then synthesize the corresponding audios via the text-to-speech model. Meanwhile, a number of strategies are delicately designed to ensure the naturalness and high quality of voice conversations. On this basis, we further explore the potential solutions and point out possible directions to build end-to-end VCRSs by seamlessly extracting and integrating voice-based inputs, thus delivering performance-enhanced, self-explainable, and user-friendly VCRSs. Our study aims to establish the foundation and motivate further pioneering research in the emerging field of VCRSs. This aligns with the principles of explainable AI and AI for social good, viz., utilizing technology’s potential to create a fair, sustainable, and just world. Our codes and datasets are available on GitHub (https://github.com/hyllll/VCRS ).
Content Warning: this paper may contain content that is offensive or upsetting.
Dialogue systems have been widely applied in many scenarios and are now more powerful and ubiquitous than ever before. With large neural models and massive available data, current dialogue systems have access to more knowledge than any people in their life. However, current dialogue systems still do not perform at a human level. One major gap between conversational agents and humans lies in their abilities to be aware of social norms. The development of socially-aware dialogue systems is impeded due to the lack of resources. In this paper, we present the first socially-aware dialogue corpus — SocialDial based on Chinese social culture. SocialDial consists of two parts: 1,563 multi-turn dialogues between two human speakers with fine-grained labels, and 4,870 synthetic conversations generated by ChatGPT. The human corpus covers five categories of social norms, which have 14 sub-categories in total. Specifically, it contains social factor annotations including social relation, context, social distance, and social norms. However, collecting sufficient socially-aware dialogues is costly. Thus, we harness the power of ChatGPT and devise an ontology-based synthetic data generation framework. This framework is able to generate synthetic data at scale. To ensure the quality of synthetic dialogues, we design several mechanisms for quality control during data collection. Finally, we evaluate our dataset using several pre-trained models, such as BERT and RoBERTa. Comprehensive empirical results based on state-of-the-art neural models demonstrate that modeling of social norms for dialogue systems is a promising research direction. To the best of our knowledge, SocialDial is the first socially-aware dialogue dataset that covers multiple social factors and has fine-grained labels.
Conversational recommender systems ( CRS s) aim to understand the information needs and preferences expressed in a dialogue to recommend suitable items to the user. Most of the existing conversational recommendation datasets are synthesized or simulated with crowdsourcing, which has a large gap with real-world scenarios. To bridge the gap, previous work contributes a dataset E-ConvRec, based on pre-sales dialogues between users and customer service staff in E-commerce scenarios. However, E-ConvRec only supplies coarse-grained annotations and general tasks for making recommendations in pre-sales dialogues. Different from it, we use real user needs as a clue to explore the E-commerce conversational recommendation in complex pre-sales dialogues, namely user needs-centric E-commerce conversational recommendation (UNECR).
In this paper, we construct a user needs-centric E-commerce conversational recommendation dataset (U-NEED ) from real-world E-commerce scenarios. U-NEED consists of 3 types of resources: (i) 7,698 fine-grained annotated pre-sales dialogues in 5 top categories (ii) 333,879 user behaviors and (iii) 332,148 product knowledge tuples. To facilitate the research of UNECR, we propose 5 critical tasks: (i) pre-sales dialogue understanding (ii) user needs elicitation (iii) user needs-based recommendation (iv) pre-sales dialogue generation and (v) pre-sales dialogue evaluation. We establish baseline methods and evaluation metrics for each task. We report experimental results of 5 tasks on U-NEED . We also report results on 3 typical categories. Experimental results indicate that the challenges of UNECR in various categories are different.
We propose end-to-end multimodal fact-checking and explanation generation, where the input is a claim and a large collection of web sources, including articles, images, videos, and tweets, and the goal is to assess the truthfulness of the claim by retrieving relevant evidence and predicting a truthfulness label (e.g., support, refute or not enough information), and to generate a statement to summarize and explain the reasoning and ruling process. To support this research, we construct MOCHEG, a large-scale dataset consisting of 15,601 claims where each claim is annotated with a truthfulness label and a ruling statement, and 33,880 textual paragraphs and 12,112 images in total as evidence. To establish baseline performances on MOCHEG, we experiment with several state-of-the-art neural architectures on the three pipelined subtasks: multimodal evidence retrieval, claim verification, and explanation generation, and demonstrate that the performance of the state-of-the-art end-to-end multimodal fact-checking does not provide satisfactory outcomes. To the best of our knowledge, we are the first to build the benchmark dataset and solutions for end-to-end multimodal fact-checking and explanation generation. The dataset, source code and model checkpoints are available at https://github.com/VT-NLP/Mocheg.
Recipe-MPR: A Test Collection for Evaluating Multi-aspect Preference-based Natural Language Retrieval
The rise of interactive recommendation assistants has led to a novel domain of natural language (NL) recommendation that would benefit from improved multi-aspect reasoning to retrieve relevant items based on NL statements of preference. Such preference statements often involve multiple aspects, e.g., “I would like meat lasagna but I’m watching my weight”. Unfortunately, progress in this domain is slowed by the lack of annotated data. To address this gap, we curate a novel dataset which captures logical reasoning over multi-aspect, NL preference-based queries and a set of multiple-choice, multi-aspect item descriptions. We focus on the recipe domain in which multi-aspect preferences are often encountered due to the complexity of the human diet. The goal of publishing our dataset is to provide a benchmark for joint progress in three key areas: 1) structured, multi-aspect NL reasoning with a variety of properties (e.g., level of specificity, presence of negation, and the need for commonsense, analogical, and/or temporal inference), 2) the ability of recommender systems to respond to NL preference utterances, and 3) explainable NL recommendation facilitated by aspect extraction and reasoning. We perform experiments using a variety of methods (sparse and dense retrieval, zero- and few-shot reasoning with large language models) in two settings: a monolithic setting which uses the full query and an aspect-based setting which isolates individual query aspects and aggregates the results. GPT-3 results in much stronger performance than other methods with 73% zero-shot accuracy and 83% few-shot accuracy in the monolithic setting. Aspect-based GPT-3, which facilitates structured explanations, also shows promise with 68% zero-shot accuracy. These results establish baselines for future research into explainable recommendations via multi-aspect preference-based NL reasoning.
Beyond Single Items: Exploring User Preferences in Item Sets with the Conversational Playlist Curation Dataset
Users in consumption domains, like music, are often able to more efficiently provide preferences over a set of items (e.g. a playlist or radio) than over single items (e.g. songs). Unfortunately, this is an underexplored area of research, with most existing recommendation systems limited to understanding preferences over single items. Curating an item set exponentiates the search space that recommender systems must consider (all subsets of items!): this motivates conversational approaches-where users explicitly state or refine their preferences and systems elicit preferences in natural language-as an efficient way to understand user needs. We call this task conversational item set curation and present a novel data collection methodology that efficiently collects realistic preferences about item sets in a conversational setting by observing both item-level and set-level feedback. We apply this methodology to music recommendation to build the Conversational Playlist Curation Dataset (CPCD), where we show that it leads raters to express preferences that would not be otherwise expressed. Finally, we propose a wide range of conversational retrieval models as baselines for this task and evaluate them on the dataset.
Although media bias detection is a complex multi-task problem, there is, to date, no unified benchmark grouping these evaluation tasks. We introduce the Media Bias Identification Benchmark (MBIB), a comprehensive benchmark that groups different types of media bias (e.g., linguistic, cognitive, political) under a common framework to test how prospective detection techniques generalize. After reviewing 115 datasets, we select nine tasks and carefully propose 22 associated datasets for evaluating media bias detection techniques. We evaluate MBIB using state-of-the-art Transformer techniques (e.g., T5, BART). Our results suggest that while hate speech, racial bias, and gender bias are easier to detect, models struggle to handle certain bias types, e.g., cognitive and political bias. However, our results show that no single technique can outperform all the others significantly.We also find an uneven distribution of research interest and resource allocation to the individual tasks in media bias. A unified benchmark encourages the development of more robust systems and shifts the current paradigm in media bias detection evaluation towards solutions that tackle not one but multiple media bias types simultaneously.
Conversational systems can be particularly effective in supporting complex information seeking scenarios with evolving information needs. Finding the right products on an e-commerce platform is one such scenario, where a conversational agent would need to be able to provide search capabilities over the item catalog, understand and make recommendations based on the user’s preferences, and answer a range of questions related to items and their usage. Yet, existing conversational datasets do not fully support the idea of mixing different conversational goals (i.e., search, recommendation, and question answering) and instead focus on a single goal. To address this, we introduce MG-ShopDial: a dataset of conversations mixing different goals in the domain of e-commerce. Specifically, we make the following contributions. First, we develop a coached human-human data collection protocol where each dialogue participant is given a set of instructions, instead of a specific script or answers to choose from. Second, we implement a data collection tool to facilitate the collection of multi-goal conversations via a web chat interface, using the above protocol. Third, we create the MG-ShopDial collection, which contains 64 high-quality dialogues with a total of 2,196 utterances for e-commerce scenarios of varying complexity. The dataset is additionally annotated with both intents and goals on the utterance level. Finally, we present an analysis of this dataset and identify multi-goal conversational patterns.
Explanations in conventional recommender systems have demonstrated benefits in helping the user understand the rationality of the recommendations and improving the system’s efficiency, transparency, and trustworthiness. In the conversational environment, multiple contextualized explanations need to be generated, which poses further challenges for explanations. To better measure explainability in CRS, we propose ten evaluation perspectives based on the concepts from conventional recommender systems together with the characteristics of CRS. We assess five existing CRS benchmark datasets using these metrics and observe the necessity of improving the explanation quality of CRS. To achieve this, we conduct manual and automatic approaches to extend these dialogues and construct a new CRS dataset, namely Explainable Recommendation Dialogues (E-ReDial). It includes 756 dialogues with over 2,000 high-quality rewritten explanations. We compare two baseline approaches to perform explanation generation based on E-ReDial. Experimental results suggest that models trained on E-ReDial can significantly improve explainability while introducing knowledge into the models can further improve the performance. GPT-3 in the in-context learning setting can generate more realistic and diverse movie descriptions. In contrast, T5 training on E-Redial can better generate clear reasons for recommendations based on user preferences. E-ReDial is available at https://github.com/Superbooming/E-ReDial.
Despite recent advances in information retrieval and natural language processing, rhetorical devices that exploit ambiguity or subvert linguistic rules remain a challenge for such systems. However, corpus-based analysis of wordplay has been a perennial topic of scholarship in the humanities, including literary criticism, language education, and translation studies. The immense data-gathering effort required for these studies points to the need for specialized text retrieval and classification technology, and consequently for appropriate test collections. In this paper, we introduce and analyze a new dataset for research and applications in the retrieval and processing of wordplay. Developed for the JOKER track at CLEF 2023, our annotated corpus extends and improves upon past English wordplay detection datasets in several ways. First, we introduce hundreds of additional positive examples of wordplay; second, we provide French translations for the examples; and third, we provide negative examples of non-wordplay with characteristics closely matching those of the positive examples. This last feature helps ensure that AI models learn to effectively distinguish wordplay from non-wordplay, and not simply texts differing in length, style, or vocabulary. Our test collection represents then a step towards wordplay-aware multilingual information retrieval.
Compared to general document analysis tasks, form document structure understanding and retrieval are challenging. Form documents are typically made by two types of authors; A form designer, who develops the form structure and keys, and a form user, who fills out form values based on the provided keys. Hence, the form values may not be aligned with the form designer’s intention (structure and keys) if a form user gets confused. In this paper, we introduce Form-NLU, the first novel dataset for form structure understanding and its key and value information extraction, interpreting the form designer’s intent and the alignment of user-written value on it. It consists of 857 form images, 6k form keys and values, and 4k table keys and values. Our dataset also includes three form types: digital, printed, and handwritten, which cover diverse form appearances and layouts. We propose a robust positional and logical relation-based form key-value information extraction framework. Using this dataset, Form-NLU, we first examine strong object detection models for the form layout understanding, then evaluate the key information extraction task on the dataset, providing fine-grained results for different types of forms and keys. Furthermore, we examine it with the off-the-shelf pdf layout extraction tool and prove its feasibility in real-world cases.
MMEAD, or MS MARCO Entity Annotations and Disambiguations, is a resource for entity links for the MS MARCO datasets. We specify a format to store and share links for both document and passage collections of MS MARCO. Following this specification, we release entity links to Wikipedia for documents and passages in both MS MARCO collections (v1 and v2). Entity links have been produced by the REL and BLINK systems. MMEAD is an easy-to-install Python package, allowing users to load the link data and entity embeddings effortlessly. Using MMEAD takes only a few lines of code. Finally, we show how MMEAD can be used for IR research that uses entity information. We show how to improve recall@1000 and MRR@10 on more complex queries on the MS MARCO v1 passage dataset by using this resource. We also demonstrate how entity expansions can be used for interactive search applications.
We integrate irdatasets, ir_measures, and PyTerrier with TIRA in the Information Retrieval Experiment Platform (TIREx) to promote more standardized, reproducible, scalable, and even blinded retrieval experiments. Standardization is achieved when a retrieval approach implements PyTerrier’s interfaces and the input and output of an experiment are compatible with ir_datasets and ir_measures. However, none of this is a must for reproducibility and scalability, as TIRA can run any dockerized software locally or remotely in a cloud-native execution environment. Version control and caching ensure efficient (re)execution. TIRA allows for blind evaluation when an experiment runs on a remote server or cloud not under the control of the experimenter. The test data and ground truth are then hidden from public access, and the retrieval software has to process them in a sandbox that prevents data leaks.
We currently host an instance of TIREx with 15 corpora (1.9~billion documents) on which 32 shared retrieval tasks are based. Using Docker images of 50~standard retrieval approaches, we automatically evaluated all approaches on all tasks (50 ⋅ 32 = 1,600 runs) in less than a week on a midsize cluster (1,620 cores and 24 GPUs). This instance of TIREx is open for submissions and will be integrated with the IR Anthology, as well as released open source.
In recent years, the reproducibility of recommendation models has become a severe concern in recommender systems. In light of this challenge, we have previously released a unified, comprehensive and efficient recommendation library called RecBole, attracting much attention from the research community. With the increasing number of users, we have received a number of suggestions and update requests. This motivates us to make further improvements on our library, so as to meet the user requirements and contribute to the research community. In this paper, we present a significant update of RecBole, making it more user-friendly and easy-to-use as a comprehensive benchmark library for recommendation. More specifically, the highlights of this update are summarized as: (1) we include more benchmark models and datasets, improve the benchmark framework in terms of data processing, training and evaluation, and release reproducible configurations to benchmark the recommendation models; (2) we upgrade the user friendliness of our library by providing more detailed documentation and well-organized frequently asked questions, and (3) we propose several development guidelines for the open-source library developers. These extensions make it much easier to reproduce the benchmark results and stay up-to-date with the recent advances on recommender systems. Our update is released at the link: https://github.com/RUCAIBox/RecBole.
The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives
The Archive Query Log (AQL) is a previously unused, comprehensive query log collected at the Internet Archive over the last 25 years. Its first version includes 356 million queries, 137 million search result pages, and 1.4 billion search results across 550 search providers. Although many query logs have been studied in the literature, the search providers that own them generally do not publish their logs to protect user privacy and vital business data. Of the few query logs publicly available, none combines size, scope, and diversity. The AQL is the first to do so, enabling research on new retrieval models and (diachronic) search engine analyses. Provided in a privacy-preserving manner, it promotes open research as well as more transparency and accountability in the search industry.
Graph Neural Networks (GNNs) have shown their superiority in modeling graph-structured data, and gained much attention over the last five years. Though traditional deep learning frameworks such as TensorFlow and PyTorch provide convenient tools for implementing neural network algorithms, they do not support the key operations of GNNs well, e.g., the message passing computation based on sparse matrices. To address this issue, GNN libraries such as PyG are proposed by introducing rich Application Programming Interfaces (APIs) specialized for GNNs. However, most current GNN libraries only support a specific deep learning framework as the backend, e.g., PyG is tied up with PyTorch. In practice, users usually need to combine GNNs with other neural network components, which may come from their co-workers or open-source codes with different deep-learning backends. Consequently, users have to be familiar with various GNN libraries, and rewrite their GNNs with corresponding APIs. To provide a more convenient user experience, we present Gamma Graph Library (GammaGL), a GNN library that supports multiple deep learning frameworks as backends. GammaGL uses a framework-agnostic design that allows users to easily switch between deep learning backends on top of existing components with a single line of code change. Following the tensor-centric design idea, GammaGL splits the graph data into several key tensors, and abstracts GNN computational processes (such as message passing and graph mini-batch operations) into a few key functions. We develop many efficient operators in GammaGL for acceleration. So far, GammaGL has provided more than 40 GNN examples that can be applied to a variety of downstream tasks. GammaGL also provides tools for heterogeneous graph neural networks and recommendations to facilitate research in related fields. We present the performance of models implemented by GammaGL and the time consumption of our optimized operators to show the efficiency. Our library is available at https://github.com/BUPT-GAMMA/GammaGL.
Temporal information extraction (TIE) has attracted a great deal of interest over the last two decades. Such endeavors have led to the development of a significant number of datasets. Despite its benefits, having access to a large volume of corpora makes it difficult to benchmark TIE systems. On the one hand, different datasets have different annotation schemes, which hinders the comparison between competitors across different corpora. On the other hand, the fact that each corpus is disseminated in a different format requires a considerable engineering effort for a researcher/practitioner to develop parsers for all of them. These constraints force researchers to select a limited amount of datasets to evaluate their systems which consequently limits the comparability of the systems. Yet another obstacle to the comparability of TIE systems is the evaluation metric employed. While most research works adopt traditional metrics such as precision, recall, and F1, a few others prefer temporal awareness — a metric tailored to be more comprehensive on the evaluation of temporal systems. Although the reason for the absence of temporal awareness in the evaluation of most systems is not clear, one of the factors that certainly weighs on this decision is the need to implement the temporal closure algorithm, which is neither straightforward to implement nor easily available. All in all, these problems have limited the fair comparison between approaches and consequently, the development of TIE systems. To mitigate these problems, we have developed tieval, a Python library that provides a concise interface for importing different corpora and is equipped with domain-specific operations that facilitate system evaluation. In this paper, we present the first public release of tieval and highlight its most relevant features. The library is available as open source, under MIT License, at PyPI and GitHub.
While there are many test collections for Cross-Language Information Retrieval (CLIR), none of the large public test collections focus on short informal text documents. This paper introduces a new pair of CLIR test collections with millions of Chinese or Persian Tweets or Tweet threads as documents, sixty event-motivated topics written both in English and in each of the two document languages, and three-point graded relevance judgments constructed using interactive search and active learning. The design and construction of these new test collections are described, and baseline results are presented that demonstrate the utility of the collections for system evaluation. Shallow pooling is used to assess the efficacy of active learning to select documents for judgment.
A dozen recommendation libraries have recently been developed to accommodate popular recommendation algorithms for reproducibility. However, they are almost simply a collection of algorithms, overlooking the modularization of recommendation algorithms and their usage in practical scenarios. Algorithmic modularization has the following advantages: 1) helps to understand the effectiveness of each algorithm; 2) easily assembles new algorithms with well-performed modules by either drag-and-drop programming or automatic machine learning; 3) enables reinforcement between algorithms since one algorithm may act as a module of another algorithm. To this end, we develop a highly-modularized recommender system — RecStudio, in which any recommendation algorithm is categorized into either a ranker or a retriever. In the RecStudio library, we implement 90 recommendation algorithms with the pure Pytorch, covering both common algorithms in other libraries and complex algorithms involving multiple recommendation models. RecStudio is featured from several perspectives, such as index-supported efficient recommendation and evaluation, GPU-accelerated negative sampling, hyperparameter learning on the validation, and cooperation between the retriever and ranker. RecStudio is also equipped with a web service, where the recommendation pipeline can be quickly established and visually evaluated on selected datasets, and the evaluation results are automatically archived and visualized in a leaderboard. The project and documents are released at http://recstudio.org.cn.
As social media platforms are evolving from text-based forums into multi-modal environments, the nature of misinformation in social media is also transforming accordingly. Misinformation spreaders have recently targeted contextual connections between the modalities e.g., text and image. However, existing datasets for rumor detection mainly focus on a single modality i.e., text. To bridge this gap, we construct MR2, a multimodal multilingual retrieval-augmented dataset for rumor detection. The dataset covers rumors with images and texts, and provides evidence from both modalities that are retrieved from the Internet. Further, we develop established baselines and conduct a detailed analysis of the systems evaluated on the dataset. Extensive experiments show that MR2 will provide a challenging testbed for developing rumor detection systems designed to retrieve and reason over social media posts. Source code and data are available at: https://github.com/THU-BPM/MR2.
BioSift: A Dataset for Filtering Biomedical Abstracts for Drug Repurposing and Clinical Meta-Analysis
This work presents a new, original document classification dataset, BioSift, to expedite the initial selection and labeling of studies for drug repurposing. The dataset consists of 10,000 human-annotated abstracts from scientific articles in PubMed. Each abstract is labeled with up to eight attributes necessary to perform meta-analysis utilizing the popular patient-intervention-comparator-outcome (PICO) method: has human subjects, is clinical trial/cohort, has population size, has target disease, has study drug, has comparator group, has a quantitative outcome, and an “aggregate” label. Each abstract was annotated by 3 different annotators (i.e., biomedical students) and randomly sampled abstracts were reviewed by senior annotators to ensure quality. Data statistics such as reviewer agreement, label co-occurrence, and confidence are shown. Robust benchmark results illustrate neither PubMed advanced filters nor state-of-the-art document classification schemes (e.g., active learning, weak supervision, full supervision) can efficiently replace human annotation. In short, BioSift is a pivotal but challenging document classification task to expedite drug repurposing. The full annotated dataset is publicly available and enables research development of algorithms for document classification that enhance drug repurposing.
MoocRadar: A Fine-grained and Multi-aspect Knowledge Repository for Improving Cognitive Student Modeling in MOOCs
Student modeling, the task of inferring a student’s learning characteristics through their interactions with coursework, is a fundamental issue in intelligent education. Although the recent attempts from knowledge tracing and cognitive diagnosis propose several promising directions for improving the usability and effectiveness of current models, the existing public datasets are still insufficient to meet the need for these potential solutions due to their ignorance of complete exercising contexts, fine-grained concepts, and cognitive labels. In this paper, we present MoocRadar, a fine-grained, multi-aspect knowledge repository consisting of 2,513 exercise questions, 5,600 knowledge concepts, and over 12 million behavioral records. Specifically, we propose a framework to guarantee a high-quality and comprehensive annotation of fine-grained concepts and cognitive labels. The statistical and experimental results indicate that our dataset provides the basis for the future improvements of existing methods. Moreover, to support the convenient usage for researchers, we release a set of tools for data querying, model adaption, and even the extension of our repository, which are now available at https://github.com/THU-KEG/MOOC-Radar.
Reinforcement learning based recommender systems (RL-based RS) aim at learning a good policy from a batch of collected data, by casting recommendations to multi-step decision-making tasks. However, current RL-based RS research commonly has a large reality gap. In this paper, we introduce the first open-source real-world dataset, RL4RS, hoping to replace the artificial datasets and semi-simulated RS datasets previous studies used due to the resource limitation of the RL-based RS domain. Unlike academic RL research, RL-based RS suffers from the difficulties of being well-validated before deployment. We attempt to propose a new systematic evaluation framework, including evaluation of environment simulation, evaluation on environments, and counterfactual policy evaluation. In summary, the RL4RS (Reinforcement Learning for Recommender Systems), a new resource with special concerns on the reality gaps, contains two real-world datasets, data understanding tools, tuned simulation environments, related advanced RL baselines, batch RL baselines, and counterfactual policy evaluation algorithms. The RL4RS suite can be found at https://github.com/fuxiAIlab/RL4RS.
Recently, personalized product search attracts great attention and many models have been proposed. To evaluate the effectiveness of these models, previous studies mainly utilize the simulated Amazon recommendation dataset, which contains automatically generated queries and excludes cold users and tail products. We argue that evaluating with such a dataset may yield unreliable results and conclusions, and deviate from real user satisfaction. To overcome these problems, in this paper, we release a personalized product search dataset comprised of real user queries and diverse user-product interaction types (clicking, adding to cart, following, and purchasing) collected from JD.com, a popular Chinese online shopping platform. More specifically, we sample about 170,000 active users on a specific date, then record all their interacted products and issued queries in one year, without removing any tail users and products. This finally results in roughly 12,000,000 products, 9,400,000 real searches, and 26,000,000 user-product interactions. We study the characteristics of this dataset from various perspectives and evaluate representative personalization models to verify its feasibility. The dataset can be publicly accessed at Github: https://github.com/rucliujn/JDsearch.
To date, query performance prediction (QPP) in the context of content-based image retrieval remains a largely unexplored task, especially in the query-by-example scenario, where the query is an image. To boost the exploration of the QPP task in image retrieval, we propose the first benchmark for image query performance prediction (iQPP). First, we establish a set of four data sets (PASCAL VOC 2012, Caltech-101, ROxford5k and RParis6k) and estimate the ground-truth difficulty of each query as the average precision or the precision@k, using two state-of-the-art image retrieval models. Next, we propose and evaluate novel pre-retrieval and post-retrieval query performance predictors, comparing them with existing or adapted (from text to image) predictors. The empirical results show that most predictors do not generalize across evaluation scenarios. Our comprehensive experiments indicate that iQPP is a challenging benchmark, revealing an important research gap that needs to be addressed in future work. We release our code and data as open source at https://github.com/Eduard6421/iQPP, to foster future research.
Traditionally, sparse retrieval systems relied on lexical representations to retrieve documents, such as BM25, dominated information retrieval tasks. With the onset of pre-trained transformer models such as BERT, neural sparse retrieval has led to a new paradigm within retrieval. Despite the success, there has been limited software supporting different sparse retrievers running in a unified, common environment. This hinders practitioners from fairly comparing different sparse models and obtaining realistic evaluation results. Another missing piece is, that a majority of prior work evaluates sparse retrieval models on in-domain retrieval, i.e. on a single dataset: MS MARCO. However, a key requirement in practical retrieval systems requires models that can generalize well to unseen out-of-domain, i.e. zero-shot retrieval tasks. In this work, we provide SPRINT, a unified python toolkit based on Pyserini and Lucene, supporting a common interface for evaluating neural sparse retrieval. The toolkit currently includes five built-in models: uniCOIL, DeepImpact, SPARTA, TILDEv2 and SPLADEv2. Users can also easily add customized models by defining their term weighting method. Using our toolkit, we establish strong and reproducible zero-shot sparse retrieval baselines across the well-acknowledged benchmark, BEIR. Our results demonstrate that SPLADEv2 achieves the best average score of 0.470 nDCG@10 on BEIR amongst all neural sparse retrievers. In this work, we further uncover the reasons behind its performance gain. We show that SPLADEv2 produces sparse representations with a majority of tokens outside of the original query and document which is often crucial for its performance gains, i.e. a limitation among its other sparse counterparts. We provide our SPRINT toolkit, models, and data used in our experiments publicly here: https://github.com/thakur-nandan/sprint.
This paper presents the AToMiC (Authoring Tools for Multi media Content) dataset, designed to advance research in image/text cross-modal retrieval. While vision–language pretrained transformers have led to significant improvements in retrieval effectiveness, existing research has relied on image-caption datasets that feature only simplistic image–text relationships and underspecified user models of retrieval tasks. To address the gap between these oversimplified settings and real-world applications for multimedia content creation, we introduce a new approach for building retrieval test collections. We leverage hierarchical structures and diverse domains of texts, styles, and types of images, as well as large-scale image–document associations embedded in Wikipedia. We formulate two tasks based on a realistic user model and validate our dataset through retrieval experiments using baseline models. AToMiC offers a testbed for scalable, diverse, and reproducible multimedia retrieval research. Finally, our dataset provides the basis for a dedicated track at the 2023 Text Retrieval Conference (TREC), and is publicly available at https://github.com/TREC-AToMiC/AToMiC.
Extracting events from news stories as the aim of several Natural Language Processing (NLP) applications (e.g., question answering, news recommendation, news summarization) is not a trivial task, due to the complexity of natural language and the fact that news reporting is characterized by journalistic style and norms. Those aspects entail scattering an event description over several sentences within one document (or more documents), applying a mechanism of gradual specification of event-related information. This implies a widespread use of co-reference relations among the textual elements, conveying non-linear temporal information. In addition to this, despite the achievement of state-of-the-art results in several tasks, high-quality training datasets for non-English languages are rarely available.
This paper presents our preliminary study to develop an annotated Dataset for Italian Crime Event news (DICE). The contribution of the paper are: (1) the creation of a corpus of 10,395 crime news; (2) the annotation schema; (3) a dataset of 10,395 news with automatic annotations; (4) a preliminary manual annotation using the proposed schema of 1000 documents. The first tests on DICE have compared the performance of a manual annotator with that of single-span and multi-span question answering models and shown there is still a gap in the models, especially when dealing with more complex annotation tasks and limited training data. This underscores the importance of investing in the creation of high-quality annotated datasets like DICE, which can provide a solid foundation for training and testing a wide range of NLP models.
People tend to consider social platforms as convenient media for expressing their concerns and emotional struggles. With their widespread use, researchers could access and analyze user-generated content related to mental states. Computational models that exploit that data show promising results in detecting at-risk users based on engineered features or deep learning models. However, recent works revealed that these approaches have a limited capacity for generalization and interpretation when considering clinical settings. Grounding the models’ decisions on clinical and recognized symptoms can help to overcome these limitations. In this paper, we introduce BDI-Sen, a symptom-annotated sentence dataset for depressive disorder. BDI-Sen covers all the symptoms present in the Beck Depression Inventory-II (BDI-II), a reliable questionnaire used for detecting and measuring depression. The annotations in the collection reflect whether a statement about the specific symptom is informative (i.e., exposes traces about the individual’s state regarding that symptom). We thoroughly analyze this resource and explore linguistic style, emotional attribution, and other psycholinguistic markers. Additionally, we conduct a series of experiments investigating the utility of BDI-Sen for various tasks, including the detection and severity classification of symptoms. We also examine their generalization when considering symptoms from other mental diseases. BDI-Sen may aid the development of future models that consider trustworthy and valuable depression markers.
Recommender systems have become ubiquitous in our digital lives, from recommending products on e-commerce websites to suggesting movies and music on streaming platforms. Existing recommendation datasets, such as Amazon Product Reviews and MovieLens, greatly facilitated the research and development of recommender systems in their respective domains. While the number of mobile users and applications (aka apps) has increased exponentially over the past decade, research in mobile app recommender systems has been significantly constrained, primarily due to the lack of high-quality benchmark datasets, as opposed to recommendations for products, movies, and news. To facilitate research for app recommendation systems, we introduce a large-scale dataset, called MobileRec. We constructed MobileRec from users’ activity on the Google play store. MobileRec contains 19.3 million user interactions (i.e., user reviews on apps) with over 10K unique apps across 48 categories. MobileRec records the sequential activity of a total of 0.7 million distinct users. Each of these users has interacted with no fewer than five distinct apps, which stands in contrast to previous datasets on mobile apps that recorded only a single interaction per user. Furthermore, MobileRec presents users’ ratings as well as sentiments on installed apps, and each app contains rich metadata such as app name, category, description, and overall rating, among others. We demonstrate that MobileRec can serve as an excellent testbed for app recommendation through a comparative study of several state-of-the-art recommendation approaches. The MobileRec dataset is available at https://huggingface.co/datasets/recmeapp/mobilerec.
MythQA: Query-Based Large-Scale Check-Worthy Claim Detection through Multi-Answer Open-Domain Question Answering
Check-worthy claim detection aims at providing plausible misinformation to the downstream fact-checking systems or human experts to check. This is a crucial step toward accelerating the fact-checking process. Many efforts have been put into how to identify check-worthy claims from a small scale of pre-collected claims, but how to efficiently detect check-worthy claims directly from a large-scale information source, such as Twitter, remains underexplored. To fill this gap, we introduce MythQA, a new multi-answer open-domain question answering(QA) task that involves contradictory stance mining for query-based large-scale check-worthy claim detection. The idea behind this is that contradictory claims are a strong indicator of misinformation that merits scrutiny by the appropriate authorities. To study this task, we construct TweetMythQA, an evaluation dataset containing 522 factoid multi-answer questions based on controversial topics. Each question is annotated with multiple answers. Moreover, we collect relevant tweets for each distinct answer, then classify them into three categories: “Supporting”, “Refuting”, and “Neutral”. In total, we annotated 5.3K tweets. Contradictory evidence is collected for all answers in the dataset. Finally, we present a baseline system for MythQA and evaluate existing NLP models for each system component using the TweetMythQA dataset. We provide initial benchmarks and identify key challenges for future models to improve upon. Code and data are available at: https://github.com/TonyBY/Myth-QA
RiverText: A Python Library for Training and Evaluating Incremental Word Embeddings from Text Data Streams
Word embeddings have become essential components in various information retrieval and natural language processing tasks, such as ranking, document classification, and question answering. However, despite their widespread use, traditional word embedding models present a limitation in their static nature, which hampers their ability to adapt to the constantly evolving language patterns that emerge in sources such as social media and the web (e.g., new hashtags or brand names). To overcome this problem, incremental word embedding algorithms are introduced, capable of dynamically updating word representations in response to new language patterns and processing continuous data streams.
This paper presents RiverText, a Python library for training and evaluating incremental word embeddings from text data streams. Our tool is a resource for the information retrieval and natural language processing communities that work with word embeddings in streaming scenarios, such as analyzing social media. The library implements different incremental word embedding techniques, such as Skip-gram, Continuous Bag of Words, and Word Context Matrix, in a standardized framework. In addition, it uses PyTorch as its backend for neural network training.
We have implemented a module that adapts existing intrinsic static word embedding evaluation tasks for word similarity and word categorization to a streaming setting. Finally, we compare the implemented methods with different hyperparameter settings and discuss the results.
Our open-source library is available at https://github.com/dccuchile/rivertext.
Conversion rate (CVR) estimation aims to predict the probability of conversion event after a user has clicked an ad. Typically, online publisher has user browsing interests and click feedbacks, while demand-side advertising platform collects users’ post-click behaviors such as dwell time and conversion decisions. To estimate CVR accurately and protect data privacy better, vertical federated learning (vFL) is a natural solution to combine two sides’ advantages for training models, without exchanging raw data. Both CVR estimation and applied vFL algorithms have attracted increasing research attentions. However, standardized and systematical evaluations are missing: due to the lack of standardized datasets, existing studies adopt public datasets to simulate a vFL setting via hand-crafted feature partition, which brings challenges to fair comparison. We introduce FedAds, the first benchmark for CVR estimation with vFL, to facilitate standardized and systematical evaluations for vFL algorithms. It contains a large-scale real world dataset collected from Alibaba’s advertising platform, as well as systematical evaluations for both effectiveness and privacy aspects of various vFL algorithms. Besides, we also explore to incorporate unaligned data in vFL to improve effectiveness, and develop perturbation operations to protect privacy well. We hope that future research work in vFL and CVR estimation benefits from the FedAds benchmark.
The IARPA BETTER (Better Extraction from Text Through Enhanced Retrieval) program held three evaluations of information retrieval (IR) and information extraction (IE). For both tasks, the only training data available was in English, but systems had to perform cross-language retrieval and extraction from Arabic, Farsi, Chinese, Russian, and Korean. Pooled assessment and information extraction annotation were used to create reusable IR test collections. These datasets are freely available to researchers working in cross-language retrieval, information extraction, or the conjunction of IR and IE. This paper describes the datasets, how they were constructed, and how they might be used by researchers.
A number of datasets for Relation Extraction (RE) have been created to aide downstream tasks such as information retrieval, semantic search, question answering and textual entailment. However, these datasets fail to capture financial-domain specific challenges since most of these datasets are compiled using general knowledge sources such as Wikipedia, web-based text and news articles, hindering real-life progress and adoption within the financial world. To address this limitation, we propose REFinD, the first large-scale annotated dataset of relations, with ~29K instances and 22 relations amongst 8 types of entity pairs, generated entirely over financial documents. We also provide an empirical evaluation with various state-of-the-art models as benchmarks for the RE task and highlight the challenges posed by our dataset. We observed that various state-of-the-art deep learning models struggle with numeric inference, relational and directional ambiguity. To encourage further research in this direction, REFinD is available at https://www.jpmorgan.com/technology/artificial-intelligence/initiatives/refind-dataset/problem-motivation-outcome.
Linked-DocRED – Enhancing DocRED with Entity-Linking to Evaluate End-To-End Document-Level Information Extraction Pipelines
Information Extraction (IE) pipelines aim to extract meaningful entities and relations from documents and structure them into a knowledge graph that can then be used in downstream applications. Training and evaluating such pipelines requires a dataset annotated with entities, coreferences, relations, and entity-linking. However, existing datasets either lack entity-linking labels, are too small, not diverse enough, or automatically annotated (that is, without a strong guarantee of the correction of annotations). Therefore, we propose Linked-DocRED, to the best of our knowledge, the first manually-annotated, large-scale, document-level IE dataset. We enhance the existing and widely-used DocRED dataset with entity-linking labels that are generated thanks to a semi-automatic process that guarantees high-quality annotations. In particular, we use hyperlinks in Wikipedia articles to provide disambiguation candidates. We also propose a complete framework of metrics to benchmark end-to-end IE pipelines, and we define an entity-centric metric to evaluate entity-linking. The evaluation of a baseline shows promising results while highlighting the challenges of an end-to-end IE pipeline. Linked-DocRED, the source code for the entity-linking, the baseline, and the metrics are distributed under an open-source license and can be downloaded from a public repository.
The Conversational Search (CS) paradigm allows for an intuitive interaction between the user and the system through natural language sentences and it is increasingly being adopted in various scenarios. However, its widespread experimentation has led to the birth of a multitude of conversational search systems with custom implementations and variants of information retrieval models. This exacerbates the reproducibility crisis already observed in several research areas, including Information Retrieval (IR). To address this issue, we propose DECAF: a modular and extensible conversational search framework designed for fast prototyping and development of conversational agents. Our framework integrates all the components that characterize a modern conversational search system and allows for the seamless integration of Machine Learning (ML) and Large Language Models (LLMs)-based techniques. Furthermore, thanks to its uniform interface, DECAF allows for experiments characterized by a high degree of reproducibility. DECAF contains several state-of-the-art components including query rewriting, search functions under BoW and dense paradigms, and re-ranking functions. Our framework is tested on two well-known conversational collections: TREC CAsT 2019 and TREC CAsT 2020 and the results can be used by future practitioners as baselines. Our contributions include the identification of a series of state-of-the-art components for the conversational search task and the definition of a modular framework for its implementation.
LongEval-Retrieval is a Web document retrieval benchmark that focuses on continuous retrieval evaluation. This test collection is intended to be used to study the temporal persistence of Information Retrieval systems and will be used as the test collection in the Longitudinal Evaluation of Model Performance Track (LongEval) at CLEF 2023. This benchmark simulates an evolving information system environment – such as the one a Web search engine operates in – where the document collection, the query distribution, and relevance all move continuously, while following the Cranfield paradigm for offline evaluation. To do that, we introduce the concept of a dynamic test collection that is composed of successive sub-collections each representing the state of an information system at a given time step. In LongEval-Retrieval, each sub-collection contains a set of queries, documents, and soft relevance assessments built from click models. The data comes from Qwant, a privacy-preserving Web search engine that primarily focuses on the French market. LongEval-Retrieval also provides a ‘mirror’ collection: it is initially constructed in the French language to benefit from the majority of Qwant’s traffic, before being translated to English. This paper presents the creation process of LongEval-Retrieval and provides baseline runs and analysis.
SESSION: Demo Papers
Vector quantization is one of the critical techniques which enables dense retrieval for realtime applications. The recent study shows that vanilla vector quantization methods, like those implemented by FAISS , are lossy and prone to limited retrieval performances when large acceleration ratios are needed [14, 16, 18]. Besides, there have also been multiple algorithms which make the retriever and VQ better collaborated to alleviate such a loss. On top of these progresses, we develop LibVQ, which optimizes vector quantization for efficient dense retrieval. Our toolkit is highlighted for three advantages. 1. Effectiveness. The retrieval quality can be substantially improved over the vanilla implementations of VQ. 2. Simplicity. The optimization can be conducted in a lowcode fashion, and the optimization results can be easily loaded to ANN indexes to support downstream applications. 3. Universality. The optimization is agnostic to the embedding’s learning process, and may accommodate different input conditions and ANN back-ends with little modification of the workflow. LibVQ may also support rich applications beyond dense retrieval, e.g., embedding compression, topic modeling, and de-duplication. In this demo, we provide comprehensive hand-on examples and evaluations for LibVQ. The toolkit is publicly released at: https://github.com/staoxiao/LibVQ/tree/demo.
Preference judgments have been established as an effective method for offline evaluation of information retrieval systems with advantages to graded or binary relevance judgments. Graded judgments assign each document a pre-defined grade level, while preference judgments involve assessing a pair of items presented side by side and indicating which is better. However, leveraging preference judgments may require a more extensive number of judgments, and there are limitations in terms of evaluation measures. In this study, we present a new preference judgment tool called JUDGO, designed for expert assessors and researchers. The tool is supported by a new heap-like preference judgment algorithm that assumes transitivity and allows for ties. An earlier version of the tool was employed by NIST to determine up to the top-10 best items for each of the 38 topics for the TREC 2022 Health Misinformation track, with over 2,200 judgments collected. The current version has been applied in a separate research study to collect almost 10,000 judgments, with multiple assessors completing each topic. The code and resources are available at https://judgo-system.github.io.
The objective of High-Recall Information Retrieval (HRIR) is to retrieve as many relevant documents as possible for a given search topic. One approach to HRIR is Technology-Assisted Review (TAR), which uses information retrieval and machine learning techniques to aid the review of large document collections. TAR systems are commonly used in legal eDiscovery and systematic literature reviews. Successful TAR systems are able to find the majority of relevant documents using the least number of assessments. Commonly used retrospective evaluation assumes that the system achieves a specific, fixed recall level first, and then measures the precision or work saved (e.g., precision at r% recall). This approach can cause problems related to understanding the behaviour of evaluation measures in a fixed recall setting. It is also problematic when estimating time and money savings during technology-assisted reviews.
This paper presents a new visual analytics tool to explore the dynamics of evaluation measures depending on recall level. We implemented 18 evaluation measures based on the confusion matrix terms, both from general IR tasks and specific to TAR. The tool allows for a comparison of the behaviour of these measures in a fixed recall evaluation setting. It can also simulate savings in time and money and a count of manual vs automatic assessments for different datasets depending on the model quality. The tool is open-source, and the demo is available under the following URL: https://vombat.streamlit.app.
Mathematical notation is a key analytical resource for science and technology. Unfortunately, current math-aware search engines require LATEX or template palettes to construct formulas, which can be challenging for non-experts. Also, their indexed collections are primarily web pages where formulas are represented explicitly in machine-readable formats (e.g., LATEX, Presentation MathML). The new MathDeck system searches PDF documents in a portion of the ACL Anthology using both formulas and text, and shows matched words and formulas along with other extracted formulas in-context. In PDF, formulas are not demarcated: a new indexing module extracts formulas using PDF vector graphics information and computer vision techniques. For non-expert users and visual editing, a central design feature of MathDeck’s interface is formula ‘chips’ usable in formula creation, search, reuse, and annotation with titles and descriptions in cards. For experts, LATEX is supported in the text query box and the visual formula editor. MathDeck is open-source, and our demo is available online.
In this paper, we offer a new visualization tool — Dataset Statistical View (DSV), to lower the barrier of research entry by providing easy access to the question-answering (QA) datasets that researchers can build their work upon. Our target users are new researchers to the QA domain with no prior knowledge nor programming skills. The system is populated with multiple QA datasets, which covers a wide range of QA tasks. It allows researchers to explore and compare existing QA datasets at a one-stop website. The system shows statistical graphs for each QA dataset to offer an overview and a visual comparison between datasets. Although this paper focuses mainly at the syntactic level comparison, integrating bias and semantic level analysis is our ongoing work. We believe our DSV system is a valuable contribution to the advancement of the QA field, as it provides a solid starting point for new researchers and practitioners. An overview of the framework is demonstrated in this paper and the introduction of the application system is available at https://cnchuy.github.io/images/demo.mp4.
Recent rapid advances in deep pre-trained language models and the introduction of large datasets have powered research in embedding-based neural retrieval. While many excellent research papers have emerged, most of them come with their own implementations, which are typically optimized for some particular research goals instead of efficiency or code organization. In this paper, we introduce Tevatron, a neural retrieval toolkit that is optimized for efficiency, flexibility, and code simplicity. Tevatron enables model training and evaluation for a variety of ranking components such as dense retrievers, sparse retrievers, and rerankers. It also provides a standardized pipeline that includes text processing, model training, corpus/query encoding, and search. In addition, Tevatron incorporates well-studied methods for improving retriever effectiveness such as hard negative mining and knowledge distillation. We provide an overview of Tevatron in this paper, demonstrating its effectiveness and efficiency on multiple IR and QA datasets. We highlight Tevatron’s flexible design, which enables easy generalization across datasets, model architectures, and accelerator platforms (GPUs and TPUs). Overall, we believe that Tevatron can serve as a solid software foundation for research on neural retrieval systems, including their design, modeling, and optimization.
Efficiently retrieving the top-k documents for a given query is a fundamental operation in many search applications. Dynamic pruning algorithms accelerate top-k retrieval over inverted indexes by skipping documents that are not able to enter the current set of results. However, the performance of these algorithms depends on a number of variables such as the ranking function, the order of documents within the index, and the number of documents to be retrieved. In this paper, we propose a diagnostic framework, Dyno, for profiling and visualizing the performance of dynamic pruning algorithms. Our framework captures processing traces during retrieval, allowing the operations of the index traversal algorithm to be visualized. These visualizations support both query-level and system-to-system comparisons, enabling performance characteristics to be readily understood for different systems. Dyno benefits both academics and practitioners by furthering our understanding of the behavior of dynamic pruning algorithms, allowing better design choices to be made during experimentation and deployment.
MetroScope: An Advanced System for Real-Time Detection and Analysis of Metro-Related Threats and Events via Twitter
Metro systems are vital to our daily lives, but they face safety or reliability challenges, such as criminal activities or infrastructure disruptions, respectively. Real-time threat detection and analysis are crucial to ensure their safety and reliability. Although many existing systems use Twitter to detect metro-related threats or events in real-time, they have limitations in event analysis and system maintenance. Specifically, they cannot analyze event development, or prioritize events from numerous tweets. Besides, their users are required to continuously monitor system notifications, use inefficient content retrieval methods, and perform detailed system maintenance. We addressed those issues by developing the MetroScope system, a real-time threat/event detection system applied to Washington D.C. metro system. MetroScope can automatically analyze event development, prioritize events based on urgency, send emergency notifications via emails, provide efficient content retrieval, and self-maintain the system. Our MetroScope system is now available at http://orion.nvc.cs.vt.edu:5000/, with a video (https://www.youtube.com/watch?v=vKIK9M60-J8) introducing its features and instructions. MetroScope is a significant advancement in enhancing the safety and reliability of metro systems.
A challenge for search technologies is to support scientific literature surveys that present overviews of the reported numerical values documented for specific physical properties. We present SciHarvester, a system tailored to address this problem for agronomic science. It provides an interface to search PubAg documents, allowing complex queries involving restrictions on numerical values. SciHarvester identifies relevant documents and generates overview of reported parameter values. The system allows interrogation of the results to explain the system’s performance. Our evaluations demonstrate the promise of incorporating information extraction techniques with the use of neural scoring mechanisms.
Since the dynamic characteristics of knowledge graphs, many inductive knowledge graph representation learning (KGRL) works have been proposed in recent years, focusing on enabling prediction over new entities. NeuralKG-ind is the first library of inductive KGRL as an important update of NeuralKG library. It includes standardized processes, rich existing methods, decoupled modules, and comprehensive evaluation metrics. With NeuralKG-ind, it is easy for researchers and engineers to reproduce, redevelop, and compare inductive KGRL methods. The library, experimental methodologies, and model re-implementing results of NeuralKG-ind are all publicly released at https://github.com/zjukg/NeuralKG/tree/ind https://github.com/zjukg/NeuralKG/tree/ind.
The increasing popularity of social media promotes the proliferation of misinformation, especially in the communities of Chinese-speaking diasporas, which has caused significant negative societal impacts. In addition, most of the existing efforts on misinformation mitigation have focused on English and other western languages, which makes numerous overseas Chinese a very vulnerable population to online disinformation campaigns. In this paper, we present AMICA, an information retrieval system for alleviating misinformation for Chinese Americans. AMICA dynamically collects data from popular social media platforms for Chinese Americans, including WeChat, Twitter, YouTube, and Chinese forums. The data are stored and indexed in Elasticsearch to provide advanced search functionalities. Given a user query, the ranking of social media posts considers both topical relevance and the likelihood of being misinformation.
In this paper, we propose “Petition Executing Process Optimizer (PEPO),” an AI-based petition processing system that features three components, (a) Department Classification, (b) Importance Assessment, and (c) Response Generation for improving the Public Work Bureau (PWB) 1999 Hotline petitions handling process in Taiwan. Our Department Classification algorithm has been evaluated with NDCG, achieving an impressive score of 86.48%, while the Important Assessment function has an accuracy rate of 85%. Besides, Response Generation enhances communication efficiency between the government and citizens. The PEPO system has been deployed as an online web service for the Public Works Bureau of the Tainan City Government. With PEPO, the PWB benefits greatly from the effectiveness and efficiency of handling citizens’ petitions.
Community search, which looks for query-dependent communities in a graph, is an important task in graph analysis. Existing community search studies address the problem by finding a densely-connected subgraph containing the query. However, many real-world networks are heterogeneous with rich semantics. Queries in heterogeneous networks generally involve in multiple communities with different semantic connections, while returning a single community with mixed semantics has limited applications. In this paper, we revisit the community search problem on heterogeneous networks and introduce a novel paradigm of heterogeneous community search and ranking. We propose to automatically discover the query semantics to enable the search of different semantic communities and develop a comprehensive community evaluation model to support the ranking of results. We build HeteroCS, a heterogeneous community search system with semantic explanation, upon our semantic community model, and deploy it on two real-world graphs. We present a demonstration case to illustrate the novelty and effectiveness of the system.
Pre-trained language models (PLMs) have emerged as the foundation of the most advanced Information Retrieval (IR) models. Powered by PLMs, the latest IR research has proposed novel models, new domain adaptation algorithms as well as enlarged datasets. In this paper, we present a Python-based IR toolkit OpenMatch-v2. As a full upgrade of OpenMatch proposed in 2021, OpenMatch-v2 incorporates the most recent advancements of PLM-based IR research, providing support for new, cross-modality models and enhanced domain adaptation techniques with a streamlined, optimized infrastructure. The code of OpenMatch is publicly available at https://github.com/OpenMatch/OpenMatch.
Modern user profiling approaches capture different forms of interactions with the data, from user-item to user-user relationships. Graph Neural Networks (GNNs) have become a natural way to model these behaviours and build efficient and effective user profiles. However, each GNN-based user profiling approach has its own way of processing information, thus creating heterogeneity that does not favour the benchmarking of these techniques. To overcome this issue, we present FairUP, a framework that standardises the input needed to run three state-of-the-art GNN-based models for user profiling tasks. Moreover, given the importance that algorithmic fairness is getting in the evaluation of machine learning systems, FairUP includes two additional components to (1) analyse pre-processing and post-processing fairness and (2) mitigate the potential presence of unfairness in the original datasets through three pre-processing debiasing techniques. The framework, while extensible in multiple directions, in its first version, allows the user to conduct experiments on four real-world datasets. The source code is available at https://link.erasmopurif.com/FairUP-source-code, and the web application is available at https://link.erasmopurif.com/FairUP.
Over the past years, notable progress has been made towards fighting misinformation spread over social media, encouraging the development of many fact-checking systems. However, systems that operate over Arabic content are scarce. In this work, we bridge this gap by proposing Tahaqqaq (Verify), an Arabic real-time system that helps users verify claims over Twitter with several functionalities, such as identifying check-worthy claims, estimating credibility of users in terms of spreading fake news, and finding authoritative accounts. Tahaqqaq has a friendly online Web interface that supports various real-time user scenarios. In the same breath, we enable public access to Tahaqqaq services through a handy RESTful API. Finally, in terms of performance, multiple components of Tahaqqaq outperform the state-of-the-art models on Arabic datasets.
Entity alignment (EA) aims to find equivalent entities in different knowledge graphs (KGs). State-of-the-art EA approaches generally use Graph Neural Networks (GNNs) to encode entities. However, most of them train the models and evaluate the results in a full-batch fashion, which prohibits EA from being scalable on large-scale datasets. To enhance the usability of GNN-based EA models in real-world applications, we present SEA, a scalable entity alignment system that enables to (i) train large-scale GNNs for EA, (ii) speed up the normalization and the evaluation process, and (iii) report clear results for users to estimate different models and parameter settings. SEA can be run on a computer with merely one graphic card. Moreover, SEA encompasses six state-of-the-art EA models and provides access for users to quickly establish and evaluate their own models. Thus, SEA allows users to perform EA without being involved in tedious implementations, such as negative sampling and GPU-accelerated evaluation. With SEA, users can gain a clear view of the model performance. In the demonstration, we show that SEA is user-friendly and is of high scalability even on computers with limited computational resources.
Attractive images or videos are the visual backbones of journalism and social media to gain the user’s attention. From trailers to teaser images to image galleries, appealing visuals have only grown in importance over the years. However, selecting eye-catching shots from a video or the perfect image from large image collections is a challenging and time-consuming task. We present our tool that can assess image and video content from an aesthetic standpoint. We discovered that it is possible to perform such an assessment by combining expert knowledge with data-driven information. We combine the relevant aesthetic features and machine learning algorithms into an aesthetics retrieval system, which enables users to sort uploaded visuals based on an aesthetic score and interact with additional photographic, cinematic, and person-specific features.
During past years, several frameworks for (Neural) Information Retrieval have been proposed. However, while they allow reproducing already published results, it is still very hard to re-use some parts of the learning pipelines, such as for instance the pre-training, sampling strategy, or a loss in newly developed models. It is also difficult to use new training techniques with old models, which makes it more difficult to assess the usefulness of ideas on various neural IR models. This slows the adoption of new techniques, and in turn, the development of the IR field. In this paper, we present XpmIR, a Python library defining a reusable set of experimental components. The library already contains state-of-the-art models and indexation techniques and is integrated with the HuggingFace hub.
Undertaking research in domain-specific scenarios such as systematic review literature search, legal search, and patent search can often have a high barrier of entry due to complicated indexing procedures and complex Boolean query syntax. Indexing and searching document collections like PubMed in off-the-shelf tools such as Elasticsearch and Lucene often yields less accurate (and less effective) results than the PubMed search engine, i.e., retrieval results do not match what would be retrieved if one issued the same query to PubMed. Furthermore, off-the-shelf tools have their own nuanced query languages and do not allow directly using the often large and complicated Boolean queries seen in domain-specific search scenarios. The pybool_ir toolkit aims to address these problems and to lower the barrier to entry for developing new methods for domain-specific search. The toolkit is an open source package available at https://github.com/hscells/pybool_ir.
Video analysis platforms that integrate automatic solutions for multimedia and information retrieval enable various applications in many disciplines including film and media studies, communication science, and education. However, current platforms for video analysis either focus on manual annotations or include only a few tools for automatic content analysis. In this paper, we present a novel web-based video analysis platform called TIB AV-Analytics (TIB-AV-A). Unlike previous platforms, TIB-AV-A integrates state-of-the-art approaches in the fields of computer vision, audio analysis, and natural language processing for many relevant video analysis tasks. To facilitate future extensions and to ensure interoperability with existing tools, the video analysis approaches are implemented in a plugin structure with appropriate interfaces and import-export functions. TIB-AV-A leverages modern web technologies to provide users with a responsive and interactive web interface that enables manual annotation and provides access to powerful deep learning tools without a requirement for specific hardware dependencies. Source code and demo are publicly available at: https://service.tib.eu/tibava.
In this work, we present SONAR, a web-based tool for multimodal exploration of Non-Fungible Token (NFT) inspiration networks. SONAR is conceived to support both creators and traders in the emerging Web3 by providing an interactive visualization of the inspiration-driven connections between NFTs, at both individual level and collection level. SONAR can hence be useful to identify new investment opportunities as well as anomalous inspirations. To demonstrate SONAR’s capabilities, we present an application to the largest and most representative dataset concerning the NFT landscape to date, showing how our proposed tool can scale and ensure high-level user experience up to millions of edges.
This work presents CoreKB, a Web platform for searching reliable facts over gene expression-cancer associations Knowledge Base (KB). It provides search capabilities over an RDF graph using natural language queries, structured facets, and autocomplete. CoreKB is designed to be intuitive and easy to use for healthcare professionals, medical researchers, and clinicians. The system offers the user a comprehensive overview of the scientific evidence supporting a medical fact. It provides a quantitative comparison between the possible gene-cancer associations a particular fact can reflect.
ranxhub is an online repository for sharing artifacts deriving from the evaluation of Information Retrieval systems. Specifically, we provide a platform for sharing pre-computed runs: the ranked lists of documents retrieved for a specific set of queries by a retrieval model. We also extend ranx, a Python library for the evaluation and comparison of Information Retrieval runs, adding functionalities to integrate the usage of ranxhub seamlessly, allowing the user to compare the results of multiple systems in just a few lines of code. In this paper, we first outline the many advantages and implications that an online repository for sharing runs can bring to the table. Then, we introduce ranxhub and its integration with ranx, showing its very simple usage. Finally, we discuss some use cases for which ranxhub can be highly valuable for the research community.
Podcasts are spoken documents that, in recent years, have gained widespread popularity. Despite the growing research interest in this domain, conducting user studies remains challenging due to the lack of datasets that include user behaviour. In particular, there is a need for a podcast streaming platform that reduces the overhead of conducting user studies. To address these issues, in this work, we present Podify. It is the first web-based platform for podcast streaming and consumption specifically designed for research. The platform highly resembles existing streaming systems to provide users with a high level of familiarity on both desktop and mobile. A catalogue of podcast episodes can be easily created via RSS feeds. The platform also offers Elasticsearch-based indexing and search that is highly customisable, allowing research and experimentation in podcast search. Users can manually curate playlists of podcast episodes for consumption. With mechanisms to collect explicit feedback from users (i.e., liking and disliking behaviour), Podify also automatically collects implicit feedback (i.e., all user interactions). Users’ behaviour can be easily exported to a readable format for subsequent experimental analysis. A demonstration of the platform is available at https://youtu.be/k9Z5w_KKHr8, with the code and documentation available at https://github.com/NeuraSearch/Podify.
This paper introduces a novel visualization tool that facilitates the exploratory analysis of continuous evaluation for information retrieval systems. We base our analysis on score standardization and meta-analysis techniques applied to Information Retrieval evaluation. We present three functionalities: evaluation overview, delta evaluation, and meta-analysis applied to three perspectives: evaluation rounds, queries, and systems. To illustrate the use of the tool, we provide an example using the TREC-COVID test collection.
SESSION: SIRIP Papers
Multi-lingual Semantic Search for Domain-specific Applications: Adobe Photoshop and Illustrator Help Search
Search has become an integral part of Adobe products and users rely on it to learn about tool usage, shortcuts, quick links, and ways to add creative effects and to find assets such as backgrounds, templates, and fonts. Within applications such as Photoshop and Illustrator, users express domain-specific search intents via short text queries. In this work, we leverage sentence-BERT models fine-tuned on Adobe’s HelpX data to perform multi-lingual semantic search on help and tutorial documents. We used behavioral data (queries, clicks, and impressions) and additional annotated data to train several BERT-based models for scoring query-document pairs for semantic similarity. We benchmarked the keyword-based production system against semantic search. Subsequent AB tests demonstrate that this approach improves engagement for longer queries while reducing null results significantly.
Instant search systems present results to the user at every keystroke. This type of search system works best when the query ambiguity is low, the catalog is limited, and users know what they are looking for. However, Spotify’s catalog is large and diverse, leading some users to struggle when formulating search intents. Query suggestions can be a powerful tool that helps users to express intents and explore content from the long-tail of the catalog. In this paper, we explain how we introduce query suggestions in Spotify’s instant search system–a system that connects hundreds of millions of users with billions of items in our audio catalog. Specifically, we describe how we: (1) generate query suggestions from instant search logs, which largely contains in-complete prefix queries that cannot be directly applied as suggestions; (2) experiment with the generated suggestions in a specific UI feature, Related Searches; and (3) develop new metrics to measure whether the feature helps users to express search intent and formulate exploratory queries.
Aiming at helping users locally discover retail services (e.g., entertainment and dining) on Online to Offline (O2O) service platforms, we propose COUPA, an industrial system targeting for characterizing user preference with inspiring considerations of time and position aware preferences. We carefully implement and deploy COUPA in Alipay with a cooperation of edge, streaming and batch computing, as well as a two-stage online serving mode, to support several popular recommendation scenarios. Extensive experiments reveal the superior performance of COUPA for recommendation.
In the ride-hailing business, compensation is mostly used to motivate consumers to place more orders and grow the market scale. However, most of the previous studies focus on car-hailing services. Few works investigate localized smart transportation innovations, such as intra-city freight logistics and designated driving. In addition, satisfying consumer fairness and improving consumer surplus, with the objective of maximizing revenue, are also important. In this paper, we propose a consumer compensation system, where a transfer learning enhanced uplift modeling is designed to measure the elasticity, and a model predictive control based optimization is formulated to control the budget accurately. Our implementation is effective and can keep the online environment lightweight. The proposed system has been deployed in the production environment of the real-world ride-hailing platform for 300 days, which outperforms the expert strategy by using 0.5% less subsidy and achieving 14.4% more revenue.
As the largest local retail & instant delivery platform in China, Meituan Waimai has deployed a personalized recommender system on server and recommend nearby stores to users through APP homepage. To capture real-time intention of users and flexibly adjust the recommendation results on the homepage, we further add an interactive recommender system. The existing interactive recommender systems in the industry mainly capture intention of users based on their feedback on a specific UI of questions. However, we find that it will undermine use fluency and increase use complexity by rashly inserting a new question UI when users browse the homepage. Therefore, we develop an Embedded Interactive Recommender System (EIRS) that directly infers users’ intention according to their click behaviors on the homepage and dynamically inserts a new recommendation result into the homepage1. To demonstrate the effectiveness of EIRS, we conduct systematic online A/B Tests, where click-through & conversion rate of the inserted EIRS result is 132% higher than that of the initial result on the homepage, and the overall gross merchandise volume is effectively enhanced by 0.43%.
Integrity and Junkiness Failure Handling for Embedding-based Retrieval: A Case Study in Social Network Search
Embedding based retrieval has seen its usage in a variety of search applications like e-commerce, social networking search etc. While the approach has demonstrated its efficacy in tasks like semantic matching and contextual search, it is plagued by the problem of uncontrollable relevance. In this paper, we conduct an analysis of embedding-based retrieval launched in early 2021 on our social network search engine, and define two main categories of failures introduced by it, integrity and junkiness. The former refers to issues such as hate speech and offensive content that can severely harm user experience, while the latter includes irrelevant results like fuzzy text matching or language mismatches. Efficient methods during model inference are further proposed to resolve the issue, including indexing treatments and targeted user cohort treatments, etc. Though being simple, we show the methods have good offline NDCG and online A/B tests metrics gain in practice. We analyze the reasons for the improvements, pointing out that our methods are only preliminary attempts to this important but challenging problem. We put forward potential future directions to explore.
End-to-end generation-based approaches have been investigated and applied in task-oriented dialogue systems. However, in industrial scenarios, existing methods face the bottlenecks of reliability (e.g., domain-inconsistent responses, repetition problem, etc) and efficiency (e.g., long computation time, etc). In this paper, we propose a task-oriented dialogue system via action-level generation. Specifically, we first construct dialogue actions from large-scale dialogues and represent each natural language (NL) response as a sequence of dialogue actions. Further, we train a Sequence-to-Sequence model which takes the dialogue history as the input and outputs a sequence of dialogue actions. The generated dialogue actions are transformed into verbal responses. Experimental results show that our light-weighted method achieves competitive performance, and has the advantage of reliability and efficiency.
Understanding customer requirements is a key success factor for both business-to-consumer (B2C) and business-to-business (B2B) enterprises. In a B2C context, most requirements are directly related to products and therefore expressed in keyword-based queries. In comparison, B2B requirements contain more information about customer needs and as such the queries are often in a longer form. Such long-form queries pose significant challenges to the information retrieval task in B2B context. In this work, we address the long-form information retrieval challenges by proposing a combination of (i) traditional retrieval methods, to leverage the lexical match from the query, and (ii) state-of-the-art sentence transformers, to capture the rich context in the long queries. We compare our method against traditional TF-IDF and BM25 models on an internal dataset of 12,368 pairs of long-form requirements and products sold. The evaluation shows promising results and provides directions for future work.
The embedding index has become an essential part of the dense retrieval (DR) system, which enables a fast search for billion of items in online E-commerce applications. To accelerate the retrieval process in industrial scenarios, most of the previous studies only utilize item embeddings. However, the product quantization process without query embeddings will lead to inconsistency between queries and items. A straightforward solution is to put query embedding into the product quantization process. But we found that the distance of the positive query and item embedding pairs is too large, which means the query and item embeddings learned by the two-tower are not fully aligned. This problem would lead to performance decay when directly putting query embeddings into the product quantization.
In this paper, we propose a novel query-aware embedding Index framework, which aligns the query and item embedding space to reduce the distance between positive pairs, thereby mixing the query and item embeddings to learn better cluster centers for product quantization. Specifically, we first propose s symmetric loss to train a better two-tower to achieve space alignment. Subsequently, we propose a mixed quantization strategy to put the query embeddings into the product quantization process for bridging the gap between queries and compressed item embeddings. Extensive experiments show that our framework significantly outperforms previous models on a real-world dataset, which demonstrates the superiority and effectiveness of the framework.
Online allocation is a critical challenge in constrained recommendation systems, where the distribution of goods, ads, vouchers, and other content to users with limited resources needs to be managed effectively. While the existing literature has made significant progress in improving recommendation algorithms for various scenarios, less attention has been given to developing and deploying industry-scale online allocation system in an efficient manner. To address this issue, this paper introduces an integrated and efficient learning framework in constrained recommendation scenarios at Alipay. The framework has been tested through experiments, demonstrating its superiority over other state-of-the-art methods.
Image retrieval system (IRS) is commonly used in E-Commerce platforms for a wide range of applications such as price comparison and commodity recommendation. However, customers may experience inconsistent retrieval problems. Although the retrieved image contains the query object, the main product of the retrieved image is not associated with the query product. This is caused by the wrong product instance location when building the product image retrieval library. We can easily determine which product is on sale through the hint of the title, so we propose Text-Guided MuliModal Product Location (TMML) to use additional product titles to assist in locating the actual selling product instance. We design a weakly-aligned region-text data collection method to generate region-text pseudo-label by utilizing the IRS and user behavior from the E-commerce platform. To mitigate the impact of data noise, we propose a Mutual-Aware Contrastive Loss. Our results show that the proposed TMML outperforms the state-of-the-art method GLIP  by 3.95% in top-1 precision on our multi-objects test set, and 2.53% error located images in AliExpress has been corrected, which greatly alleviates the retrieval inconsistencies in IRS.
In recent years, many methods have been proposed to alleviate the biases in recommender systems by combining biased data and unbiased data. Among these methods, data imputation method is effective, but previous works only employ a straightforward model to generate imputed data, which can not fully characterize the data. In this paper, we propose a novel data imputation approach that combines an unbiased model and a debiasing model with adaptively learnt weights. We conduct extensive experiments on two public recommendation datasets and one production dataset to demonstrate the effectiveness and robustness of the proposed method.
For many business applications that require the processing, indexing, and retrieval of professional documents such as legal briefs (in PDF format etc.), it is often essential to classify the pages of any given document into their corresponding types beforehand. Most existing studies in the field of document image classification either focus on single-page documents or treat multiple pages in a document independently. Although in recent years a few techniques have been proposed to exploit the context information from neighboring pages to enhance document page classification, they typically cannot be utilized with large pre-trained language models due to the constraint on input length. In this paper, we present a simple but effective approach that overcomes the above limitation. Specifically, we enhance the input with extra tokens carrying sequential information about previous pages — introducing recurrence — which enables the usage of pre-trained Transformer models like BERT for context-aware page classification. Our experiments conducted on two legal datasets in English and Portuguese respectively show that the proposed approach can significantly improve the performance of document page classification compared to the non-recurrent setup as well as the other context-aware baselines.
Facebook content search is a critical channel that enables people to discover the best content to deepen their engagement with friends and family, creators, and communities. Building a highly personalized search engine to serve billions of daily active users to find the best results from a large scale of candidates is a challenging task. The search engine must take multiple dimensions into consideration, including different content types, different query intents, and user social graph, etc. In this paper, we discuss the challenges of Facebook content search in depth, and then describe our novel approach to efficiently handling a massive number of documents with advanced query understanding, retrieval, and machine learning techniques. The proposed system has been fully verified and applied to the production system of Facebook Search, which serves billions of users.
One of the most challenging aspects of operating a large-scale web search engine is to accurately evaluate and monitor the search engine’s result quality regardless of search types. From a business perspective, in the face of such challenges, it is important to establish a universal search quality metric that can be easily understood by the entire organisation. In this paper, we introduce a model-based quality metric using Explainable Boosting Machine as the classifier and online user behaviour signals as features to predict search quality. The proposed metric takes into account a variety of search types and has good interpretability. To examine the performance of the metric, we constructed a large dataset of user behaviour on search engine results pages (SERPs) with SERP quality ratings from professional annotators. We compared the performance of the model in our metric to those of other black-box machine learning models on the dataset. We also share a few experiences within our company for the org-wide adoption of this metric relevant to metric design.
Performing prior art searches is an essential step in both patent drafting and invalidation. The task is challenging due to the large number of existing patent documents and the domain knowledge required to analyze the documents. We present a graph-based patent search engine that tries to mimic the work done by a professional patent examiner. Each patent document is converted to a graph that describes the parts of the invention and the relations between the parts. The search engine is powered by a graph neural network that learns to find prior art by using novelty citation data from patent office search reports where citations are compiled by human patent examiners. We show that a graph-based approach is an efficient way to perform searches on technical documents and demonstrate it in the context of patent searching.
The online performance of customer service bots is often less than satisfactory because of the gap between limited training data and real-world user questions. As a straightforward way to improve online performance, model iteration and re-deployment are time consuming and labor-intensive, and therefore difficult to sustain. To fix badcases and improve online performance of chatbots in a timely and continuous manner, we propose a data-centric solution consisting of three main modules: badcase detection, bad case correction, and answer extraction. By making full use of online model signals, implicit user feedback and artificial customer service log, the proposed solution can fix online badcases automatically. Our solution has been deployed and bringing consistently positive impacts for hundreds of customer service bots used by Alipay app.
In the multimedia era, image becomes an effective medium in search advertising. Dynamic Image Advertising (DIA), a system that matches queries with appropriate ad images and generates multimodal ads, is introduced to improve user experience and ad revenue. The core of DIA is a query-image matching module performing ad image retrieval and relevance modeling. Current query-image matching suffers from data scarcity and inconsistency, and insufficient cross-modal fusion. Also, the retrieval and relevance models are separately trained, affecting overall performance. In this paper, we propose a vision-language framework for query-image matching. It consists of two parts. First, we design a base model combining different encoders and tasks, and train it on large-scale image-text pairs to learn general multimodal representation. Then, we fine-tune the base model on advertising business data, unifying relevance modeling and retrieval through multi-objective learning. Our framework has been implemented in Baidu search advertising system “Phoneix Nest”. Online evaluation shows that it improves cost per mille (CPM) and click-through rate (CTR) by 1.04% and 1.865% on the system main traffic.
Query understanding plays a key role in exploring users’ search intents and facilitating users to locate their most desired information. However, it is inherently challenging since it needs to capture semantic information from short and ambiguous queries and often requires massive task-specific labeled data. In recent years, pre-trained language models (PLMs) have advanced various natural language processing tasks because they can extract general semantic information from large-scale corpora. However, directly applying them to query understanding is sub-optimal because existing strategies rarely consider to boost the search performance. On the other hand, search logs contain user clicks between queries and urls that provide rich users’ search behavioral information on queries beyond their content. Therefore, in this paper, we aim to fill this gap by exploring search logs. In particular, we propose a novel graph-enhanced pre-training framework, GE-BERT, which leverages both query content and the query graph. The model is trained on a query graph where nodes are queries and two queries are connected if they lead to clicks on the same urls, to capture both semantic information and users’ search behavioral information of queries. Extensive experiments on offline and online tasks have demonstrated the effectiveness of the proposed framework.
We present part of Huawei’s efforts in building a Product Knowledge Graph (PKG). We want to identify which product attributes (i.e. properties) are relevant and important in terms of shopping decisions to product categories (i.e. classes). This is particularly challenging when the attributes and their values are mined from online product catalogues, i.e. HTML pages. These web pages contain semi-structured data, which do not follow a concerted format and use diverse vocabulary to designate the same features. We propose a system for key attribute identification (KATIE) based on fine-tuning pre-trained models (e.g., DistilBERT) to predict the applicability and importance of an attribute to a category. We also propose an attribute synonyms identification module that allows us to discover synonymous attributes by considering not only their labels’ similarities but also the similarity of their values sets. We have evaluated our approach to Huawei categories taxonomy and a set of internally mined attributes from web pages. KATIE guarantees promising performance results compared to the most recent baselines.
A Transformer-Based Substitute Recommendation Model Incorporating Weakly Supervised Customer Behavior Data
The substitute-based recommendation is widely used in E-commerce to provide better alternatives to customers. However, existing research typically uses customer behavior signals like co-view and view-but-purchase-another to capture the substitute relationship. Despite its intuitive soundness, such an approach might ignore the functionality and characteristics of products. In this paper, we adapt substitute recommendations into language matching problem. It takes the product title description as model input to consider product functionality. We design a new transformation method to de-noise the signals derived from production data. In addition, we consider multilingual support from the engineering point of view. Our proposed end-to-end transformer-based model achieves both successes from offline and online experiments. The proposed model has been deployed in a large-scale E-commerce website for 11 marketplaces in 6 languages. Our proposed model is demonstrated to increase revenue by 19% based on an online A/B experiment.
Friend recommendation systems in online social and professional networks such as Snapchat helps users find friends and build connections, leading to better user engagement and retention. Traditional friend recommendation systems take advantage of the principle of locality and use graph traversal to retrieve friend candidates, e.g. Friends-of-Friends (FoF). While this approach has been adopted and shown efficacy in companies with large online networks such as Linkedin and Facebook, it suffers several challenges: (i) discrete graph traversal offers limited reach in cold-start settings, (ii) it is expensive and infeasible in realtime settings beyond 1 or 2 hop requests owing to latency constraints, and (iii) it cannot well-capture the complexity of graph topology or connection strengths, forcing one to resort to other mechanisms to rank and find top-K candidates. In this paper, we proposed a new Embedding Based Retrieval (EBR) system for retrieving friend candidates, which complements the traditional FoF retrieval by retrieving candidates beyond 2-hop, and providing a natural way to rank FoF candidates. Through online A/B test, we observe statistically significant improvements in the number of friendships made with EBR as an additional retrieval source in both low- and high-density network markets. Our contributions in this work include deploying a novel retrieval system to a large-scale friend recommendation system at Snapchat, generating embeddings for billions of users using Graph Neural Networks, and building EBR infrastructure in production to support Snapchat scale.
Modeling Spoken Information Queries for Virtual Assistants: Open Problems, Challenges and Opportunities
Virtual assistants are becoming increasingly important speech-driven Information Retrieval platforms that assist users with various tasks. We discuss open problems and challenges with respect to modeling spoken information queries for virtual assistants, and list opportunities where Information Retrieval methods and research can be applied to improve the quality of virtual assistant speech recognition. We discuss how query domain classification, knowledge graphs and user interaction data, and query personalization can be helpful to improve the accurate recognition of spoken information domain queries. Finally, we also provide a brief overview of current problems and challenges in speech recognition.
The personalized stock recommendation is a task to recommend suitable stocks for each investor. The personalized recommendations are valuable, especially in investment decision making as the objective of building a portfolio varies by each retail investor. In this paper, we propose a Personalized Stock Recommendation with Investors’ Attention and Contextual Information (PSRIC). PSRIC aims to incorporate investors’ financial decision-making process into a stock recommendation, and it consists of an investor modeling module and a context module. The investor modeling module models the investor’s attention toward various stock information. The context module incorporates stock dynamics and investor profiles. The result shows that the proposed model outperforms the baseline models and verifies the usefulness of both modules in ablation studies.
The complexity of industry-grade event-based datalakes grows dynamically each passing hour. Companies actively gather behavioral information on their customers, recording multiple types of events, such as clicks, likes, page views, card transactions, add-to-basket, or purchase events. In response to this, the Synerise Monad platform has been proposed. The primary focus of Monad is to produce Universal Behavioral Representations (UBRs) – large vectors encapsulating the behavioral patterns of each user. UBRs do not lose knowledge about individual events, in contrast to aggregated features or averaged embeddings. They are based on award-winning algorithms developed at Synerise – Cleora and EMDE – and allow to process real-life datasets composed of billions of events in record time. In this paper, we introduce a new aspect of Monad: private foundation models for behavioral data, trained on top of UBRs. The foundation models are trained in purely self-supervised manner and allow to exploit general knowledge about human behavior, which proves especially useful when multiple downstream models must be trained and time constraints are tight, or when labeled data is scarce. Experimental results show that the Monad foundation models can cut training time in half and require 3x less data to reach optimal results, often achieving state-of-the-art results.
Accurate Named Entity Recognition (NER) is crucial for various information retrieval tasks in industry. However, despite significant progress in traditional NER methods, the extraction of Complex Named Entities remains a relatively unexplored area. In this paper, we propose a novel system that combines object detection for Document Layout Analysis (DLA) with weakly supervised learning to address the challenge of extracting discontinuous complex named entities in legal documents. Notably, to the best of our knowledge, this is the first work to apply weak supervision to DLA. Our experimental results show that the model trained solely on pseudo labels outperforms the supervised baseline when gold-standard data is limited, highlighting the effectiveness of our proposed approach in reducing the dependency on annotated data.
Online Food Recommendation Service (OFRS) has remarkable spatiotemporal characteristics and the advantage of being able to conveniently satisfy users’ needs in a timely manner. There have been a variety of studies that have begun to explore its spatiotemporal properties, but a comprehensive and in-depth analysis of the OFRS spatiotemporal features is yet to be conducted. Therefore, this paper studies the OFRS based on three questions: how spatiotemporal features play a role; why self-attention cannot be used to model the spatiotemporal sequences of OFRS; and how to combine spatiotemporal features to improve the efficiency of OFRS. Firstly, through experimental analysis, we systemically extracted the spatiotemporal features of OFRS, identified the most valuable features and designed an effective combination method. Secondly, we conducted a detailed analysis of the spatiotemporal sequences, which revealed the shortcomings of self-attention in OFRS, and proposed a more optimized spatiotemporal sequence method for replacing self-attention. In addition, we also designed a Dynamic Context Adaptation Model to further improve the efficiency and performance of OFRS. Through the offline experiments on two large datasets and online experiments for a week, the feasibility and superiority of our model were proven.
In marketing recommendations, the campaign organizers will distribute coupons to users to encourage consumption. In general, a series of strategies are employed to interfere with the coupon distribution process, leading to a growing imbalance between user-coupon interactions, resulting in a bias in the estimation of conversion probabilities. We refer to the estimation bias as the matching bias. In this paper, we explore how to alleviate the matching bias from the causal-effect perspective. We regard the historical distributions of users and coupons over each other as confounders and characterize the matching bias as a confounding effect to reveal and eliminate the spurious correlations between user-coupon representations and conversion probabilities. Then we propose a new training paradigm named De-Matching Bias Recommendation (DMBR) to remove the confounding effects during model training via the backdoor adjustment. We instantiate DMBR on two representative models: DNN and MMOE, and conduct extensive offline and online experiments to demonstrate the effectiveness of our proposed paradigm.
Transformer-based models have achieved tremendous success in sequential recommendation (SR), but they suffer from consuming excessive computational resources, particularly in the inference stage. Thus, developing lightweight yet effective SR models has become a frequent demand in industrial applications, which is also in line with the ideals of Green AI and Green IR. In this applied paper, we introduce GreenSeq deployed in Alipay to automatically design Green networks that can provide appropriate recommendations with lower computational consumption in SR. Specifically, GreenSeq uses a novel multi-layer search space that allows for flexible network design and a Greenness-aware loss term for balancing efficiency and effectiveness. Experiments on benchmark datasets and A/B testing show that GreenSeq performs well while using fewer resources. GreenSeq also reduces electricity and carbon emissions in Alipay.
The cold-start problem of conversion rate prediction is a common challenge in online advertising systems. To alleviate this problem, a large number of methods either use content information or uncertainty methods, or use meta-learning based methods to improve the ranking performance of cold-start items. However, they can work for cold-start scenarios but fail to adaptively unify warm and cold recommendations into one model, requiring additional human efforts or knowledge to adapt to different scenarios. Meanwhile, none of them pay attention to the discrepancy between model predictions and true likelihoods of cold items, while over- or under-estimation is harmful to the ROI (Return on Investment) of advertising placements. In this paper, in order to address the above issues, we propose a framework called Distribution-Constrained Batch Transformer (DCBT). Specifically, the framework introduces a Transformer module into the batch dimension to automatically choose proper information from warm samples to enhance the representation of cold samples and preserve the property of warm samples. In addition, to avoid the distribution of cold samples being affected by the warm samples, the framework adds MMD loss to constrain the sample distribution before and after feeding into the Transformer module. Extensive offline experiments on two real-world datasets show that our proposed method attains state-of-the-art performance in AUC and PCOC (Predicted CVR over CVR) for cold items and warm items. An online A/B test demonstrates that the DCBT model obtained a 20.08% improvement in CVR and a 13.21% increase in GMV (Gross Merchandise Volume).
To retrieve personalized campaigns and creatives while protecting user privacy, digital advertising is shifting from member-based identity to cohort-based identity. Under such identity regime, an accurate and efficient cohort building algorithm is desired to group users with similar characteristics. In this paper, we propose a scalable K-anonymous cohort building algorithm called consecutive consistent weighted sampling (CCWS). The proposed method combines the spirit of the (p-powered) consistent weighted sampling (CWS) and hierarchical clustering, so that the K-anonymity is ensured by enforcing a lower bound on the size of cohorts. Evaluations on a LinkedIn dataset consisting of >70M users and ads campaigns demonstrate that CCWS achieves substantial improvements over several hashing-based methods including sign random projections (SignRP), minwise hashing (MinHash), as well as the vanilla CWS.
Query Parsing aims to extract product attributes, such as color, brand, and product type, from search queries. These attributes play a crucial role in search engines for tasks such as matching, ranking, and recommendation. There are two types of attributes: explicit attributes that are mentioned explicitly in the search query, and implicit attributes that are mentioned implicitly. Existing works on query parsing do not differentiate between explicit query parsing and implicit query parsing, which limits their performance in product search engines. In this work, we demonstrate the critical importance of implicit attributes in real-world product search engines. We then present our solution for implicit query parsing at Amazon Search, which is a unified framework combining recent advancements in knowledge graph technologies and customer behavior analysis. We demonstrate the effectiveness of our proposal through offline experiments on Amazon search log data. We also show how to deploy and use the framework on Amazon search to improve customers’ shopping experiences.
E-commerce search engines comprise a retrieval phase and a ranking phase, where the first one returns a candidate product set given user queries. Recently, vision-language pre-training, combining textual information with visual clues, has been popular in the application of retrieval tasks. In this paper, we propose a novel V+L pre-training method to solve the retrieval problem in Taobao Search. We design a visual pre-training task based on contrastive learning, outperforming common regression-based visual pre-training tasks. In addition, we adopt two negative sampling schemes, tailored for the large-scale retrieval task. Besides, we introduce the details of the online deployment of our proposed method in real-world situations. Extensive offline/online experiments demonstrate the superior performance of our method on the retrieval task. Our proposed method is employed as one retrieval channel of Taobao Search and serves hundreds of millions of users in real time.
Knowledge-intensive programming Q&A is an active research area in industry. Its application boosts developer productivity by aiding developers in quickly finding programming answers from the vast amount of information on the Internet. In this study, we propose ProQANS and its variants ReProQANS and ReAugProQANS to tackle programming Q&A. ProQANS is a neural search approach that leverages unlabeled data on the Internet (such as StackOverflow) to mitigate the cold-start problem. ReProQANS extends ProQANS by utilizing reformulated queries with a novel triplet loss. We further use an auxiliary generative model to augment the training queries, and design a novel dual triplet loss function to adapt these generated queries, to build another variant of ReProQANS termed as ReAugProQANS. In our empirical experiments, we show ReProQANS has the best performance when evaluated on the in-domain test set, while ReAugProQANS has the superior performance on the out-of-domain real programming questions, by outperforming the state-of-the-art model by up to 477% lift on the MRR metric respectively. The results suggest their robustness to previously unseen questions and its wide application to real programming questions.
Spellchecking is one of the most fundamental and widely used search features. Correcting incorrectly spelled user queries not only enhances the user experience but is expected by the user. However, most widely available spellchecking solutions are either lower accuracy than state-of-the-art solutions or too slow to be used for search use cases where latency is a key requirement. Furthermore, most innovative recent architectures focus on English and are not trained in a multilingual fashion and are trained for spell correction in longer text, which is a different paradigm from spell correction for user queries, where context is sparse (most queries are 1-2 words long). Finally, since most enterprises have unique vocabularies such as product names, off-the-shelf spelling solutions fall short of users’ needs.
In this work, we build a multilingual spellchecker that is extremely fast and scalable and that adapts its vocabulary and hence speller output based on a specific product’s needs. Furthermore our speller out-performs general purpose spellers by a wide margin on in-domain datasets. Our multilingual speller is used in search in Adobe products, powering autocomplete in various applications.
Lookalike models are based on the assumption that user similarity plays an important role towards product selling and enhancing the existing advertising campaigns from a very large user base. Challenges associated to these models reside on the heterogeneity of the user base and its sparsity. In this work, we propose a novel framework that unifies the customers’ different behaviors or features such as demographics, buying behaviors on different platforms, customer loyalty behaviors and build a lookalike model to improve customer targeting for Rakuten Group, Inc. Extensive experiments on real e-commerce and travel datasets demonstrate the effectiveness of our proposed lookalike model for user targeting task.
Semantic retrieval, which retrieves semantically matched items given a textual query, has been an essential component to enhance system effectiveness in e-commerce search. In this paper, we study the multimodal retrieval problem, where the visual information (e.g, image) of item is leveraged as supplementary of textual information to enrich item representation and further improve retrieval performance. Though learning from cross-modality data has been studied extensively in tasks such as visual question answering or media summarization, multimodal retrieval remains a non-trivial and unsolved problem especially in the asymmetric scenario where the query is unimodal while the item is multimodal. In this paper, we propose a novel model named SMAR, which stands for Semantic-enhanced Modality-Asymmetric Retrieval, to tackle the problem of modality fusion and alignment in this kind of asymmetric scenario. Extensive experimental results on an industrial dataset show that the proposed model outperforms baseline models significantly in retrieval accuracy. We have open sourced our industrial dataset for the sake of reproducibility and future research works.
Illegal live-streaming identification, which aims to help live-streaming platforms immediately recognize the illegal behaviors in the live-streaming, such as selling precious and endangered animals, plays a crucial role in purifying the network environment. Traditionally, the live-streaming platform needs to employ some professionals to manually identify the potential illegal live-streaming. Specifically, the professional needs to search for related evidence from a large-scale knowledge database for evaluating whether a given live-streaming clip contains illegal behavior, which is time-consuming and laborious. To address this issue, in this work, we propose a multimodal evidence retrieval system, named OFAR, to facilitate the illegal live-streaming identification. OFAR consists of three modules: Query Encoder, Document Encoder, and MaxSim-based Contrastive Late Intersection. Both query encoder and document encoder are implemented with the advanced OFA encoder, which is pretrained on a large-scale multimodal dataset. In the last module, we introduce contrastive learning on the basis of the MaxiSim-based late intersection, to enhance the model’s ability of query-document matching. The proposed framework achieves significant improvement on our industrial dataset TaoLive, demonstrating the advances of our scheme.
Online evaluation techniques are widely adopted by industrial search engines to determine which ranking models perform better under a certain business metric. However, online evaluation can only evaluate a small number of rankers and people resort to offline evaluation to select rankers that are likely to yield good online performance. To use offline metrics for effective model selection, a major challenge is to understand how well offline metrics predict which ranking models perform better in online experiments. This paper aims to address this challenge in product search ranking. Towards this end, we collect gold data in the form of preferences over ranker pairs under a business metric in e-commerce search engine. For the first time, we use such gold data to evaluate offline metrics in terms of directional agreement with the business metric. Furthermore, we analyze offline metrics in terms of discriminative power through paired sample t-test and rank correlations among offline metrics. Through extensive online and offline experiments, we studied 36 offline metrics and observed that: (1) Offline metrics align well with online metrics: they agree on which one of two ranking models is better up to 97% of times; (2) Offline metrics are highly discriminative on large-scale search ranking data, especially NDCG (Normalized Discounted Cumulative Gain) which has a discriminative power over 99%.
Finding the right product on e-commerce websites with millions of products is a daunting task for a large set of customers. On the search page, product attribute filters a.k.a. “refinements” emerge as a convenient navigational option for customers to narrow down the search results along product attributes of their choice (e.g., Material:Cotton, Color:Black for ‘shirt’). However, on mobile devices, refinements are not easily discoverable due to lack of screen space. To improve discoverability, contextually relevant refinements are suggested in-line on search page by refinement recommendation systems. Existing works on refinement recommendations primarily rely on the search context as input, and are trained using aggregated refinement preferences ‘explicitly’ expressed by customers. These solutions fail to capture ‘implicit’ preferences expressed during the customer shopping mission through in-session browsing activity. In this paper, we propose a session-based recommendation system (SBRS) which recommends refinements by inferring product attribute preferences of customers based on the sequence of products viewed earlier in the session. For the task of refinement recommendation, we propose a) AttriBERT, a model which extends BERT architecture to learn from the attribute values of products and b) a novel product representation strategy, which represents each product as a dictionary of attribute:value pairs (e.g., RAM Size:64GB). We evaluate our approach on RecSys 2022 Challenge and Amazon e-commerce datasets. Our approach consistently outperforms various state-of-the-art sequence models on the task of session-based refinement recommendation.
This full-day tutorial introduces modern techniques for practical uncertainty quantification specifically in the context of multi-class and multi-label text classification. First, we explain the usefulness of estimating aleatoric uncertainty and epistemic uncertainty for text classification models. Then, we describe several state-of-the-art approaches to uncertainty quantification and analyze their scalability to big text data: Virtual Ensemble in GBDT, Bayesian Deep Learning (including Deep Ensemble, Monte-Carlo Dropout, Bayes by Backprop, and their generalization Epistemic Neural Networks), Evidential Deep Learning (including Prior Networks and Posterior Networks), as well as Distance Awareness (including Spectral-normalized Neural Gaussian Process and Deep Deterministic Uncertainty). Next, we talk about the latest advances in uncertainty quantification for pre-trained language models (including asking language models to express their uncertainty, interpreting uncertainties of text classifiers built on large-scale language models, uncertainty estimation in text generation, calibration of language models, and calibration for in-context learning). After that, we discuss typical application scenarios of uncertainty quantification in text classification (including in-domain calibration, cross-domain robustness, and novel class detection). Finally, we list popular performance metrics for the evaluation of uncertainty quantification effectiveness in text classification. Practical hands-on examples/exercises are provided to the attendees for them to experiment with different uncertainty quantification methods on a few real-world text classification datasets such as CLINC150.
This half day tutorial introduces the participant to the basic concepts underlying neural Cross-Language Information Retrieval (CLIR). It discusses the most common algorithmic approaches to CLIR, focusing on modern neural methods; the history of CLIR; where to find and how to use CLIR training collections, test collections and baseline systems; how CLIR training and test collections are constructed; and open research questions in CLIR.
Data-driven recommender systems have demonstrated great success in various Web applications owing to the extraordinary ability of machine learning models to recognize patterns (ie correlation) from users’ behaviors. However, they still suffer from several issues such as biases and unfairness due to spurious correlations. Considering the causal mechanism behind data can avoid the influences of such spurious correlations. In this light, embracing causal recommender modeling is an exciting and promising direction.
In this tutorial, we aim to introduce the key concepts in causality and provide a systemic review of existing work on causal recommendation. We will introduce existing methods from two different causal frameworks — the potential outcome (PO) framework and the structural causal model (SCM). We will give examples and discussions regarding how to utilize different causal tools under these two frameworks to model and solve problems in recommendation. Moreover, we will summarize and compare the paradigms of PO-based and SCM-based recommendation. Besides, we identify some open challenges and potential future directions for this area. We hope this tutorial could stimulate more ideas on this topic and facilitate the development of causality-aware recommender systems.
This tutorial will provide an overview of recent advances on neuro-symbolic approaches for information retrieval. A decade ago, knowledge graphs and semantic annotations technology led to active research on how to best leverage symbolic knowledge. At the same time, neural methods have demonstrated to be versatile and highly effective.
From a neural network perspective, the same representation approach can service document ranking or knowledge graph reasoning. End-to-end training allows to optimize complex methods for downstream tasks.
We are at the point where both the symbolic and the neural research advances are coalescing into neuro-symbolic approaches. The underlying research questions are how to best combine symbolic and neural approaches, what kind of symbolic/neural approaches are most suitable for which use case, and how to best integrate both ideas to advance the state of the art in information retrieval.
Materials are available online: https://github.com/laura-dietz/neurosymbolic-representations-for-IR
Since its inception, the field of unbiased learning to rank (ULTR) has remained very active and has seen several impactful advancements in recent years. This tutorial provides both an introduction to the core concepts of the field and an overview of recent advancements in its foundations along with several applications of its methods.
The tutorial is divided into four parts: Firstly, we give an overview of the different forms of bias that can be addressed with ULTR methods. Secondly, we present a comprehensive discussion of the latest estimation techniques in the ULTR field. Thirdly, we survey published results of ULTR in real-world applications. Fourthly, we discuss the connection between ULTR and fairness in ranking. We end by briefly reflecting on the future of ULTR research and its applications.
This tutorial is intended to benefit both researchers and industry practitioners who are interested in developing new ULTR solutions or utilizing them in real-world applications.
In this tutorial, we aim to shed light on the task of recommending a set of multiple items at once. In this scenario, historical interaction data between users and items could also be in the form of a sequence of interactions with sets of items. Complex sets of items being recommended together occur in different and diverse domains, such as grocery shopping with so-called baskets and fashion set recommendation with a focus on outfits rather than individual clothing items. We describe the current landscape of research and expose our participants to real-world examples of item set recommendation. We further provide our audience with hands-on experience via a notebook session. Finally, we describe open challenges and call for further research in the area, which we hope will inspire both early stage and more experienced researchers.
This tutorial presents explainable information retrieval (ExIR), an emerging area focused on fostering responsible and trustworthy deployment of machine learning systems in the context of information retrieval. As the field has rapidly evolved in the past 4-5 years, numerous approaches have been proposed that focus on different access modes, stakeholders, and model development stages. This tutorial aims to introduce IR-centric notions, classification, and evaluation styles in ExIR, while focusing on IR-specific tasks such as ranking, text classification, and learning-to-rank systems. We will delve into method families and their adaptations to IR, extensively covering post-hoc methods, axiomatic and probing approaches, and recent advances in interpretability-by-design approaches. We will also discuss ExIR applications for different stakeholders, such as researchers, practitioners, and end-users, in contexts like web search, patent and legal search, and high-stakes decision-making tasks. To facilitate practical understanding, we will provide a hands-on session on applying ExIR methods, reducing the entry barrier for students, researchers, and practitioners alike.
ChatGPT and similar large language model (LLM) based conversational agents have brought shock waves to the research world. Although astonished by their human-like performance, we find they share a significant weakness with many other existing conversational agents in that they all take a passive approach in responding to user queries. This limits their capacity to understand the users and the task better and to offer recommendations based on a broader context than a given conversation. Proactiveness is still missing in these agents, including their ability to initiate a conversation, shift topics, or offer recommendations that take into account a more extensive context. To address this limitation, this tutorial reviews methods for equipping conversational agents with proactive interaction abilities.
The full-day tutorial is divided into four parts, including multiple interactive exercises. We will begin the tutorial with an interactive exercise and cover the design of existing conversational systems architecture and challenges. The content includes coverage of LLM-based recent advancements such as ChatGPT and Bard, along with reinforcement learning with human feedback (RLHF) technique. Then we will introduce the concept of proactive conversation agents and preset recent advancements in proactiveness of conversational agents, including actively driving conversations by asking questions, topic shifting, and methods that support strategic planning of conversation. Next, we will discuss important issues in conversational responses’ quality control, including safety, appropriateness, language detoxication, hallucination, and alignment. Lastly, we will launch another interactive exercise and discussion with the audience to arrive at concluding remarks, prospecting open challenges and new directions. By exploring new techniques for enhancing conversational agents’ proactive behavior to improve user engagement, this tutorial aims to help researchers and practitioners develop more effective conversational agents that can better understand and respond to user needs proactively and safely.
Multifaceted, empirical evaluation of algorithmic ideas is one of the central pillars of Information Retrieval (IR) research. The IR community has a rich history of studying the effectiveness of indexes, retrieval algorithms, and complex machine learning rankers and, at the same time, quantifying their computational costs, from creation and training to application and inference. As the community moves towards even more complex deep learning models, questions on efficiency have once again become relevant with renewed urgency. Indeed, efficiency is no longer limited to time and space; instead it has found new, challenging dimensions that stretch to resource-, sample- and energy-efficiency with ramifications for researchers, users, and the environment alike. Examining algorithms and models through the lens of holistic efficiency requires the establishment of standards and principles, from defining relevant concepts, to designing metrics, to creating guidelines for making sense of the significance of new findings. The second iteration of the ReNeuIR workshop aims to bring the community together to debate these questions, with the express purpose of moving towards a common benchmarking framework for efficiency.
Generative information retrieval (IR) has experienced substantial growth across multiple research communities (e.g., information retrieval, computer vision, natural language processing, and machine learning), and has been highly visible in the popular press. Theoretical, empirical, and actual user-facing products have been released that retrieve documents (via generation) or directly generate answers given an input request. We would like to investigate whether end-to-end generative models are just another trend or, as some claim, a paradigm change for IR. This necessitates new metrics, theoretical grounding, evaluation methods, task definitions, models, user interfaces, etc. The goal of this workshop1 is to focus on previously explored Generative IR techniques like document retrieval and direct Grounded Answer Generation, while also offering a venue for the discussion and exploration of how Generative IR can be applied to new domains like recommendation systems, summarization, etc. The format of the workshop is interactive, including roundtable and keynote sessions and tends to avoid the one-sided dialogue of a mini-conference.
Knowledge discovery from unstructured data, including business documents, web content, and news articles, has been a key AI challenge for the financial services industry. Comprehending these corpora and discovering knowledge from them, which could be textual, tabular, or graphic, are the cornerstone of supporting business decisions in the financial services domain, where information retrieval and content analysis techniques are of fundamental importance. We propose a workshop on knowledge discovery from unstructured data in financial services at SIGIR 2023 to highlight the current and emerging opportunities, invite original research, and prompt success sharing between researchers.
Most machine learning models are designed to be self-contained and encode both “knowledge” and “reasoning” in their parameters. However, such models cannot perform effectively for tasks that require knowledge grounding and tasks that deal with non-stationary data, such as news and social media. Besides, these models often require huge number of parameters to encode all the required knowledge. These issues can be addressed via augmentation with a retrieval model. This category of machine learning models, which is called Retrieval-enhanced machine learning (REML), has recently attracted considerable attention in multiple research communities. For instance, REML models have been studied in the context of open-domain question answering, fact verification, and dialogue systems and also in the context of generalization through memorization in language models and memory networks. We believe that the information retrieval community can significantly contribute to this growing research area by designing, implementing, analyzing, and evaluating various aspects of retrieval models with applications to REML tasks. The goal of this full-day hybrid workshop is to bring together researchers from industry and academia to discuss various aspects of retrieval-enhanced machine learning, including effectiveness, efficiency, and robustness of these models in addition to their impact on real-world applications.
A wide range of core information retrieval (IR) tasks, such as searching, ranking, and filtering, to name a few, have seen tremendous improvements thanks to machine learning (ML) and artificial intelligence (AI). The traditional centralized approach to training AI/ML models is still predominant: large volumes of data generated by end users must be transferred from their origins and shared with remote locations for processing. However, this centralized paradigm suffers from significant privacy issues and does not take full advantage of the computing power of client devices like modern smartphones. A possible answer to this need is provided by federated learning (FL), which enables collaborative training of predictive models among a set of cooperating edge devices without disclosing any private local data. Unfortunately, FL is still far from being fully exploited in the IR ecosystem.
In this workshop proposal, we have the ambition to start filling this gap. More specifically, the first workshop on ”Federated Learning for Information ReTrieval” (FLIRT) is willing to provide an open forum for researchers and practitioners where they can exchange ideas, identify key challenges, and define the roadmap toward a successful application of FL in the broad IR area.
eCommerce Information Retrieval (IR) is receiving increasing attention in the academic literature and is an essential component of some of the largest web sites (e.g. Airbnb, Alibaba, Amazon, eBay, Facebook, Flipkart, Lowes’s, Taobao, Target). SIGIR has for several years seen sponsorship from eCommerce organizations, reflecting the importance of IR research to them. The purpose of this workshop is (1) to bring together researchers and practitioners of eCommerce IR to discuss topics unique to it, (2) to determine how to use eCommerce’s unique combination of free text, structured data, and customer behavior data to improve search relevance, and (3) to examine how to build datasets and evaluate algorithms in this domain.
The theme of this year’s eCommerce IR workshop is Foundation Models and Unified Information Access in eCommerce. The workshop solicits papers on this topic and includes a panel focused on this area. In addition, Lowe’s is sponsoring an eCommerce data challenge on Cross-modal and Multi-modal Visual Search for eCommerce. The data challenge reflects themes from the successful SIGIR workshops in 2017, 2018, 2019, 2020, 2021, and 2022. ECOM23 will be held as a full day hybrid workshop to accommodate for diverse participation.
The 1st International Workshop on Implicit Author Characterization from Texts for Search and Retrieval (IACT’23)
The first edition of the Implicit Author Characterization from Texts for Search and Retrieval (IACT’23) aims at bringing to the forefront the challenges involved in identifying and extracting from texts implicit information about authors (e.g., human or AI) and using it in IR tasks. The IACT workshop provides a common forum to consolidate multi-disciplinary efforts and foster discussions to identify the wide-ranging issues related to the task of extracting implicit author-related information from the textual content, including novel tasks and datasets. We will also discuss the ethical implications of implicit information extraction. In addition, we announce a shared task focused on automatically determining the literary epochs of written books.
Information retrieval systems for the patent domain have a long history. They can support patent experts in a variety of daily tasks: from analyzing the patent landscape to support experts in the patenting process and large-scale information extraction. Advances in machine learning and natural language processing allow to further automate tasks, such as paragraph retrieval or even patent text generation. Uncovering the potential of semantic technologies for the intellectual property (IP) industry is just getting started. Investigating the use of artificial intelligence methods for the patent domain is therefore not only of academic interest, but also highly relevant for practitioners. Compared to other domains, high quality, semi-structured, annotated data is available in large volumes (a requirement for supervised machine learning models), making training large models easier. On the other hand, domain-specific challenges arise, such as very technical language or legal requirements for patent documents. The focus of the 4th edition of this workshop will be on two-way communication between industry and academia from all areas of information retrieval in particular with the Asian community. We want to bring together novel research results and the latest systems and methods employed by practitioners in the field.
SESSION: Doctoral Consortium
As information retrieval (IR) systems, such as search engines and conversational agents, become ubiquitous in various domains, the need for transparent and explainable systems grows to ensure accountability, fairness, and unbiased results. Despite many recent advances toward explainable AI and IR techniques, there is no consensus on what it means for a system to be explainable. Although a growing body of literature suggests that explainability is comprised of multiple subfactors [2, 5, 6], virtually all existing approaches treat it as a singular notion. Additionally, while neural retrieval models (NRMs) have become popular for their ability to achieve high performance[3, 4, 7, 8], research on the explainability of NRMs has been largely unexplored until recent years. Numerous questions remain unanswered regarding the most effective means of comprehending how these intricate models arrive at their decisions and the extent to which these methods will function efficiently for both developers and end-users.
This research aims to develop effective methods to evaluate and advance explainable retrieval systems toward the broader research field goal of creating techniques to make potential biases more identifiable. Specifically, I aim to investigate the following:
- RQ1: How do we quantitatively measure explainability?
- RQ2: How can we develop a set of inherently explainable NRMs using feature attributions that are robust across different retrieval domain contexts?
- RQ3: How can we leverage knowledge about influential training instances to better understand NRMs and promote more efficient search practices?
To address RQ1, we leverage psychometrics and crowdsourcing to introduce a multidimensional model of explainability for Web search systems. Our approach builds upon prior research on multidimensional relevance modeling  and supports the multidimensionality of explainability posited by recent literature. In doing so, we provide empirical evidence that these factors group between positive and negative facets that describe the utility and roadblocks to explainability of search systems. Additionally, we introduce a continuous-scale evaluation metric for explainable search systems which enables researchers to directly compare and evaluate the efficacy of their explanations.
In future work, I plan to address RQ2 and RQ3 by investigating two avenues of attribution methods, feature-based and instance-based, to develop a suite of explainable NRMs. While much work has been done on investigating the interpretability of deep neural network architectures in the general ML field, particularly in vision and language domains, creating inherently explainable neural architectures remains largely unexplored in IR. Thus, I intend to draw on previous work in the broader fields of NLP and ML to develop methods that offer deeper insights into the inner workings of NRMs and how ranking decisions are made.
By developing explainable IR systems, we can facilitate users’ comprehension of the intricate, non-linear mechanisms that link their search queries to highly ranked content. If applied correctly, this research has the potential to benefit society in a broad range of applications, such as disinformation detection and clinical decision support. Given their critical importance in modern society, these areas demand robust solutions to combat the escalating dissemination of false information. By enhancing the transparency and accountability of these systems, explainable systems can play a crucial role in curbing this trend.
Multimodal Named Entity Recognition (MNER) and Multimodal Relation Extraction (MRE) are tasks in information retrieval that aim to recognize entities and extract relations among them using information from multiple modalities, such as text and images. Although current methods have attempted a variety of modality fusion approaches to enhance the information in text, a large amount of readily available internet retrieval data has not been considered. Therefore, we attempt to retrieve real-world text related to images, objects, and entire sentences from the internet and use this retrieved text as input for cross-modal fusion to improve the performance of entity and relation extraction tasks in the text.
Developing Information Retrieval (IR) applications such as search engines and recommendation systems require training of models that are growing in complexity and size with immense collections of data that contain multiple dimensions (documents/items text, user profiles, and interactions). Much of the research in IR concentrates on improving the performance of ranking models; however, given the high training time and high computational resources required to improve the performance by designing new models, it is crucial to address efficiency aspects of the design and deployment of IR applications at large-scale. In my thesis, I aim to improve the training efficiency of IR applications and speed up the development phase of new models, by applying dataset distillation approaches to reduce the dataset size while preserving the ranking quality and employing efficient High-Performance Computing (HPC) solutions to increase the processing speed.
The demands imposed on the user during the Information Retrieval (IR) process are well-recognised, with numerous studies highlighting the influence and importance of cost, effort, and load in explaining user search behaviour and experience. Despite this understanding, there exists no universal definitions of these constructs or standardised methods of measuring them within the field of IR. As a result, an open research problem has emerged in relation to how these constructs have been defined, interpreted, and subsequently measured. To address this research problem, the aim of this research is two-fold. Firstly, to establish working definitions and a conceptual framework for defining cost, effort, and load; and secondly, to examine and evaluate existing measures of effort and load within an IR context.
In almost every area of research, it is necessary to find experts and publications on a topic. However, finding experts and publications is a difficult task not only for computers, but also for humans. For example, searching for experts, a user often enters a topic into a search engine, which then checks which people have published on that topic. A problem arises when a user does not make their query specific enough which can happen intentionally, e.g. when the user is doing a navigational search, or unintentionally, e.g., when the user lacks knowledge. As a result, the quality of the search results may not be very high and the best results may not be found. Current and widely used search engines for bibliographic metadata, such as dblp, ResearchGate, Google Scholar or Semantic Scholar allow only keyword-based searches. Kreutz et al. presented SchenQL, a query language for bibliographic metadata that allows users to formulate their queries more easily and precisely than SQL. However, it requires training to understand the language and is not as easy for non-experts to use e.g. Google Scholar.
To address the limitations of insufficient attention to the user’s search intent and lack of search support, we aim to develop a conversational retrieval system in the domain of bibliographic metadata. This conversational search system assists users in achieving their search intent through a natural language dialog. It should be possible not only to find experts, but also to search for bibliographic metadata with the help of the system without prior knowledge.
In this work, we aim to answer the research question: How beneficial is a conversational information retrieval system for searching bibliographic data? To address the research question, our contribution is threefold. First, we present an architecture for such a conversational information retrieval system for bibliographic metadata. Second, we will implement all the components of this system and evaluate our system by comparing it to existing bibliographic data search engines in terms of effectiveness, efficiency, and user satisfaction. Third, we will create and publish a dataset consisting of user queries that we will use to train our system.
The architecture we propose consists of five main components: i) user intent classification, ii) a keyword extractor, iii) a search module, iv) a conversational module and v) the conversation history.
i) The task of user intent classification is to determine the goal the user wants to achieve with their search query. The user intent classification consists of a set of corresponding user intent classifiers, each of which is responsible for one intent. If no classifier can match the user’s query to their intent, or multiple classifiers conclude that the query matches their intents, the system asks the user to specify the query accordingly. ii) If the intent is correctly determined, a keyword extractor extracts the actual search term from the query. iii) After the intent and the search term have been determined, the actual search takes place in the search module. The user’s query could be reformulated into a SchenQL query (Kreutz et al. and sent to the database. iv) In the conversational module, the results are converted into a natural language response and the user is given suggestions for further queries related to the previous ones. v) The conversation history stores the user’s queries and the system’s responses, both to consider the entire session when determining intent and to improve the system’s components. For example, new question formulations could improve the accuracy of the user intent classifiers as well as reveal what new intents a user of such a conversational search system might have that have not yet been implemented.
In first experiments, we defined four user intents and already evaluated the user intent classification. The four user intents are: (1) searching for persons/authors/experts on a topic, (2) searching for publications by author name, (3) searching for publications on a topic and (4) searching for similar topics of a topic. Our classifiers achieved an accuracy of 0.998 in correctly determining user intent.
In the future, we not only want to evaluate the individual components of a conversational information retrieval system for bibliographic data, but also want to work out the advantages and disadvantages of the conversational information retrieval system for bibliographic data, in comparison to already existing systems that do not support the user in their search process via natural language conversations.
Modern search engines employ multi-stage ranking pipeline to balance retrieval efficiency and effectiveness for large collections. These pipelines retrieve an initial set of candidate documents from the large repository by some cost-effective retrieval model (such as BM25, LM), then re-rank these candidate documents by neural retrieval models. These pipelines perform well if the first-stage ranker achieves high recall . To achieve this, the first-stage ranker should address the problems in milliseconds.
One of the major problems of the search engine is the presence of extraneous terms in the query. Since the query document term matching is the fundamental block of any retrieval model, the retrieval effectiveness drops when the documents are getting matched with these extraneous query terms. The existing models [4, 5] address this issue by estimating weights of the terms either by using supervised approaches or by utilizing the information of a set of initial top-ranked documents and incorporating it into the final ranking function. Although the later category of methods is unsupervised, they are inefficient as ranking the large collection to get the initial top-ranked documents is computationally expensive.
Besides, in the real-world collection, some terms may appear multiple times in the documents for several reasons, such as a term may appear for different contexts, the author bursts this term, or it is an outlier. Thus, the existing retrieval models overestimate the relevance score of the irrelevant documents if they contain some query term with extremely high frequency. Paik et al.  propose a probabilistic model based on truncated distributions that reduce the contribution of such high-frequency occurrences of the terms in relevance score. But, the truncation point selection does not leverage term-specific distribution information. It treats all the relevant documents as a bag for a set of queries which is not a good way to capture the distribution of terms. Furthermore, this model does not capture the term burstiness; it only reduces the effect of the outliers. Cummins et al.  propose a language model based on Dirichlet compound multinomial distribution that can capture the term burstiness. But this model is explicitly specific to the language model.
Considering the above research gaps, we focus on the following research questions in this doctoral work.
Research Question 1: How can we identify the central query terms from the verbose query without relying on an initial ranked list or relevance judgment and modify the ranking function so that it can focus on the derived central query terms? To address RQ1, we generate the contextual vector of the entire query and individual query terms using the pre-trained BERT (Bidi-rectional Encoder Representations from Transformers) model and subsequently analyze their correlation to estimate the term centrality score so that the ranking function may focus on the central terms while term matching.
Research Question 2: How can we identify the outlier terms of the large collection and penalize them in the ranking function? For RQ2, we model the distribution of maximum normalized term frequency values of relevant documents for the terms of a set of queries. Then we estimate the probability that the normalized frequency of a new term is coming from the right extreme of that distribution and uses this probability to penalize them in the ranking function.
Research Question 3: How can we detect the bursty terms and incorporate them in the ranking function?
To address RQ3, we propose a model that estimates the burstiness score of a term from its information content in a document and use this score to penalize the bursty term in the ranking function. To estimate the information content of a term, we capture the contextual information of each occurrence of a term by utilizing the pre-trained BERT model and estimate the contextual divergence of the occurrence of a term from its previous occurrences.
With the development of new neural network architectures for graph learning in recent years, the use of graphs to store, represent and process data become more trendy. Nowadays, graphs as a rich structured form of data representation are used in many real-world projects. In these projects, objects are often defined in terms of their connections to other things. There are many practical applications in areas such as antibacterial discovery, physics simulations, fake news detection, traffic prediction and recommendation systems.
While there are a growing number of neural graph representation learning techniques that allow one to learn effective graph representations [1-3, 5], they may not necessarily be appropriate for the task of searching over graphs for several reasons: (1) nodes within a graph consist of a set of attributes, which are the subject of the search process. However, the number of unique attributes on the graph compared to the number of attributes on each node is extremely sparse; therefore, this makes it very difficult to learn effective graph representations for such sparse information; (2) graph neural networks are capable of generating rich embedding representations for nodes and entities in graph however, depending on the downstream task, the embedding vectors may perform not well as expected. Therefore, researchers need to apply solutions such as custom loss functions or pre-training tasks to adapt mentioned architectures for their specific task. Search in the graph as a downstream task follows the same trend and therefore a tailored graph neural network representation is needed to particularly address the need for a rich embedding representation for the sole purpose of subgraph search.
My work focuses on searching for subgraph structures over both complete and incomplete heterogeneous graphs. The significance of my research direction lies in the fact that exact subgraph search is an NP-hard problem and as such existing methods are either accurate but impractically slow, or efficient yet suffering from low effectiveness. With a focus on learning robust neural representations for complete and incomplete graphs, my research focuses on developing search methods that are both effective and efficient. Specifically, my research addresses the following research questions: RQ1) Would it be possible to design and develop graph representation learning methods for heterogeneous graphs that can generate effective embedding vectors from a heterogeneous graph and support effective and efficient subgraph search? RQ2) Whether it would be possible to address the issue of graphs with varying degrees of missing values. Incomplete graphs suffer from missing attributes and/or missing edges. I explore the design of robust graph representation learning models capable of effectively searching in light of missing information. RQ3) Can efficient and effective derivations of my subgraph search methods be used to address practical applications in real-world domains such as team formation, and keyword search over knowledge graphs?
Research conducted as a part of my Ph.D. has so far focused on identifying and retrieving subgraphs over complete heterogeneous graphs. I have used the Team Formation problem as a case study in order to evaluate my work . As the next step, I am planning to focus on expanding my research to include incomplete graphs and investigate methodologies to craft optimized graph representations for the specific task of searching on graphs.
The dual-encoder model is a dense retrieval architecture, consisting of two encoder models, that has surpassed traditional sparse retrieval methods for open-domain retrieval . But, room exists for improvement, particularly when dense retrievers are exposed to unseen passages or queries.
Considering out-of-domain queries, i.e., queries originating from domains other than the one the model was trained on, the loss in accuracy may be significant. A main factor for this is the mismatch in the information available to the context encoder and the query encoder during training. Common retireval training datasets contain an overwhelming majority of passages with one query from a passage. I hypothesize that this could lead the dual-encoder model, particularly the passage encoder, to overfit to a single potential query from a given passage to the detriment of out-of-domain performance. Based on this, I seek to answer the following research question: (RQ1.1) Does training a DPR model on data containing multiple queries per passage improve the generalizability of the model? To answer RQ1.1, I build generated datasets that have multiple queries for most passages, and compare dense passage retriever models trained on these datasets against models trained on (mostly) single query per passage datasets. I show that training on passages with multiple queries leads to models that generalize better to out-of-distribution and out-of-domain test datasets .
Language can be considered another domain in the context of a dense retrieval. Training a dense retrieval model is especially challenging in languages other than English due to the scarcity of training data. I propose a novel training technique, clustered training, aimed at improving the retrieval quality of dense retrievers, especially in out-of-distribution and zero-shot settings. I address the following research questions: (RQ2.1)Does clustered training improve the effectiveness of multilingual DPR models on in-distribution data? (RQ2.2) Does clustered training improve the effectiveness of multilingual DPR models on out-of-distribution data from languages that it is trained on? (RQ2.2 Does clustered training improve the effectiveness of multilingual DPR models on out-of-distribution data from languages that it is trained on? (RQ2.3) Does clustered training help multilingual DPR models to generalize to new languages (zero-shot)? I show that clustered training improves the out-of-distribution and zero-shot performance of a DPR model without a clear loss in in-distribution performance using the Mr. TyDi  dataset.
Finally, I propose a modified dual-encoder architecture that can perform both retrieval and reranking with the same model in a single forward pass. While dual encoder models can surpass traditional sparse retrieval methods, they lag behind two stage retrieval pipelines in retrieval quality. I propose a modification to the dual encoder model where a second representation is used to rerank the passages retrieved using the first representation. Here, a second stage model is not required and both representations are generated in a single forward pass from the dual encoder. I aim to answer the following research questions in this work: (RQ3.1), Can the same model be trained to effectively generate two representations intended for two uses? RQ3.2 Can the retrieval quality of the model be improved by simultaneously performing retrieval and reranking? (RQ3.3 What is the tradeoff between retrieval quality vs. latency and compute resource efficiency for the proposed method vs. a two stage retriever? I expect that my proposed architecture would improve the dual encoder retrieval quality without sacrificing throughput or needing more computational resources.
Evaluation is one of the major concerns when developing information retrieval systems. Especially in the field of conversational AI, this topic has been heavily studied in the setting of both non-task and task-oriented conversational agents (dialogue systems). Recently, several automatic metrics e.g., BLEU and ROUGE, proposed for the evaluation of dialogue systems, have shown poor correlation with human judgment and are thus ineffective for the evaluation of dialogue systems. As a consequence, a significant amount of research relies on human evaluation to estimate the effectiveness of dialogue systems[1, 4}.
An emerging approach for evaluating task-oriented dialogue systems (TDS) is to estimate a user’s overall satisfaction with the system from explicit and implicit user interaction signals [2, 3]. Though useful and effective, overall user satisfaction does not necessarily give insights into what aspects or dimensions a TDS is performing well on. Understanding why a user is satisfied or dissatisfied helps the TDS recover from an error and optimize towards an individual aspect to avoid total dissatisfaction during an interaction session.
Understanding a user’s satisfaction with TDS is crucial, mainly for two reasons. First, it allows system designers to understand different user perceptions regarding satisfaction, which in turn leads to better user personalization. Secondly, it can be used to avoid total dialogue failure by the system by deploying adaptive conversational approaches, such as failure recovery or switching topics. And, thus, fine-grained evaluation of TDS gives the system an opportunity to learn an individual user’s interaction preferences leading to a fulfilled user goal. Therefore in this research, we take the first initiative toward understanding user satisfaction with TDS. We mainly focus on the fine-grained evaluation of conversational systems in a task-oriented setting.
Exploring User and Item Representation, Justification Generation, and Data Augmentation for Conversational Recommender Systems
Conversational Recommender Systems (CRS) aim to provide personalized and contextualized recommendations through natural language conversations with users. The objective of my proposed dissertation is to capitalize on the recent developments in conversational interfaces to advance the field of Recommender Systems in several directions. I aim to address several problems in recommender systems: user and item representation, justification generation, and data sparsity.
A critical challenge in CRS is learning effective representations of users and items that capture their preferences and characteristics. First, we focus on user representation, where we use a separate corpus of reviews to learn user representation. We attempt to map conversational users into the space of reviewers using semantic similarity between the conversation and the texts of reviews. Second, we improve item representation by incorporating textual features such as item descriptions into the user-item interaction graph, which captures a great deal of semantic and behavioral information unavailable from the purely topological structure of the interaction graph.
Justifications for recommendations enhance the explainability and transparency of CRS; however, existing approaches, such as rule-based and template-based methods, have limitations. In this work, we propose an extractive method using a corpus of reviews to identify relevant information for generating concise and coherent justifications.
We address the challenge of data scarcity for CRS by generating synthetic conversations using SOTA generative pre trained transformers (GPT). These synthetic conversations are used to augment the data used for training the CRS.
In addition, we also evaluate if the GPTs exhibit emerging abilities of CRS (or a non-conversational RecSys) due to the large amount of data they are trained on, which potentially includes the reviews and opinions of users.
Recommender systems (RecSys) become increasingly prevalent in modern society, offering personalized information filtering to alleviate information overload and significantly impacting various human online activities. Machine learning-based recommendation methods have been extensively developed in recent years to achieve more accurate recommendations, with some of these approaches having been extensively deployed in industrial applications, such as the Deep Interest Network (DIN). Despite their widespread use, researchers and practitioners have highlighted various trustworthiness issues inherent in these systems, including bias and promoting polarization issues. In order to better serve users and comply with regulations pertaining to recommendation algorithms established by different countries, it is essential to consider the trustworthiness issues of recommender systems.
This research focuses on trustworthiness in recommendation from two perspectives of user-centered principles: faithfulness and responsibility. On the one hand, collected recommendation data may not faithfully reflect user preferences, especially those of the service stage, due to bias[2, 3] and temporal effects,[4,5]etc. Achieving faithful recommendations with such data is crucial to ensure user satisfaction, i.e., making recommendations faithfully reflect user preferences during the testing. On the other hand, recommender systems could not only cater to user preferences  but also unconsciously and unintentionally affect (or even manipulate) user preferences. In the recommendation process, controlling the influence of recommender systems, such as avoiding potential opinion polarization, to provide responsible recommendations is also an important aspect of building trustworthy recommender systems. Consequently, there raise four research questions on the two aspects:
- RQ1: How can we model genuine user preferences when training data fails to faithfully reflect the user’s current preferences?
- RQ2: How can we ensure that recommender models faithfully match the user’s future preferences?
- RQ3: How can we quantify and evaluate the impact of a recommender system on user preferences?
- RQ4: How can we control the impact of a recommender system on user preferences to avoid negative side effects?
Our objective is to achieve faithful and responsible recommendations for users while addressing these research questions. We attribute unfaithful recommendation to the discrepancies between the training data and the service objectives, which we formulate as different data shift problems (RQ1 and RQ2). We provide systematic analyses for these data shift problems from causal perspectives and develop several causality-inspired solutions to enhance recommendation faithfulness. In pursuit of responsible recommendations, we investigate the effect of recommender systems on users from a causal perspective. We develop a causal effect evaluation and adjustment framework to quantify and control the influence of recommender systems on user preferences (RQ3 and RQ4).