SIGIR '18- The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval

Full Citation in the ACM Digital Library

SESSION: Keynotes

Salton Award Keynote: Information Interaction in Context

Data Science for Social Good and Public Policy: Examples, Opportunities, and Challenges

Can data science help reduce police violence and misconduct? Can it help increase retention of patients in care? Can it help prevent children from getting lead poisoning? Can it help cities better target limited resources to improve lives of citizens? We're all aware of the hype around data science and related buzzwords right now but turning this hype into social impact takes cross-disciplinary training, teams, and methods. In this talk, I'll discuss lessons learned from our work at University of Chicago while working on dozens of data science projects over the past few years with non-profits and governments on high-impact public policy and social challenges in criminal justice, public health, education, economic development, public safety, workforce training, and urban infrastructure. I'll highlight opportunities for IR researchers to get involved in these efforts as well as information retrieval challenges that are open research problems that need to be solved in order to increase the effectiveness of today's machine learning and data science algorithms in order to have social and policy impact in a fair and equitable manner.

SESSION: Session 1A: New IR Applications

Neural Compatibility Modeling with Attentive Knowledge Distillation

Recently, the booming fashion sector and its huge potential benefits have attracted tremendous attention from many research communities. In particular, increasing research efforts have been dedicated to the complementary clothing matching as matching clothes to make a suitable outfit has become a daily headache for many people, especially those who do not have the sense of aesthetics. Thanks to the remarkable success of neural networks in various applications such as the image classification and speech recognition, the researchers are enabled to adopt the data-driven learning methods to analyze fashion items. Nevertheless, existing studies overlook the rich valuable knowledge (rules) accumulated in fashion domain, especially the rules regarding clothing matching. Towards this end, in this work, we shed light on the complementary clothing matching by integrating the advanced deep neural networks and the rich fashion domain knowledge. Considering that the rules can be fuzzy and different rules may have different confidence levels to different samples, we present a neural compatibility modeling scheme with attentive knowledge distillation based on the teacher-student network scheme. Extensive experiments on the real-world dataset show the superiority of our model over several state-of-the-art methods. Based upon the comparisons, we observe certain fashion insights that can add value to the fashion matching study. As a byproduct, we released the codes, and involved parameters to benefit other researchers.

Attentive Moment Retrieval in Videos

In the past few years, language-based video retrieval has attracted a lot of attention. However, as a natural extension, localizing the specific video moments within a video given a description query is seldom explored. Although these two tasks look similar, the latter is more challenging due to two main reasons: 1) The former task only needs to judge whether the query occurs in a video and returns an entire video, but the latter is expected to judge which moment within a video matches the query and accurately returns the start and end points of the moment. Due to the fact that different moments in a video have varying durations and diverse spatial-temporal characteristics, uncovering the underlying moments is highly challenging. 2) As for the key component of relevance estimation, the former usually embeds a video and the query into a common space to compute the relevance score. However, the later task concerns moment localization where not only the features of a specific moment matter, but the context information of the moment also contributes a lot. For example, the query may contain temporal constraint words, such as "first'', therefore need temporal context to properly comprehend them. To address these issues, we develop an Attentive Cross-Modal Retrieval Network. In particular, we design a memory attention mechanism to emphasize the visual features mentioned in the query and simultaneously incorporate their context. In the light of this, we obtain the augmented moment representation. Meanwhile, a cross-modal fusion sub-network learns both the intra-modality and inter-modality dynamics, which can enhance the learning of moment-query representation. We evaluate our method on two datasets: DiDeMo and TACoS. Extensive experiments show the effectiveness of our model as compared to the state-of-the-art methods.

Enhancing Person-Job Fit for Talent Recruitment: An Ability-aware Neural Network Approach

The wide spread use of online recruitment services has led to information explosion in the job market. As a result, the recruiters have to seek the intelligent ways for Person-Job Fit, which is the bridge for adapting the right job seekers to the right positions. Existing studies on Person-Job Fit have a focus on measuring the matching degree between the talent qualification and the job requirements mainly based on the manual inspection of human resource experts despite of the subjective, incomplete, and inefficient nature of the human judgement. To this end, in this paper, we propose a novel end-to-end A bility-aware P erson-J ob F it N eural N etwork (APJFNN) model, which has a goal of reducing the dependence on manual labour and can provide better interpretation about the fitting results. The key idea is to exploit the rich information available at abundant historical job application data. Specifically, we propose a word-level semantic representation for both job requirements and job seekers' experiences based on Recurrent Neural Network (RNN). Along this line, four hierarchical ability-aware attention strategies are designed to measure the different importance of job requirements for semantic representation, as well as measuring the different contribution of each job experience to a specific ability requirement. Finally, extensive experiments on a large-scale real-world data set clearly validate the effectiveness and interpretability of the APJFNN framework compared with several baselines.

Cross-Modal Retrieval in the Cooking Context: Learning Semantic Text-Image Embeddings

Designing powerful tools that support cooking activities has rapidly gained popularity due to the massive amounts of available data, as well as recent advances in machine learning that are capable of analyzing them. In this paper, we propose a cross-modal retrieval model aligning visual and textual data (like pictures of dishes and their recipes) in a shared representation space. We describe an effective learning scheme, capable of tackling large-scale problems, and validate it on the Recipe1M dataset containing nearly 1 million picture-recipe pairs. We show the effectiveness of our approach regarding previous state-of-the-art models and present qualitative results over computational cooking use cases.

SESSION: Session 1B: Log Analysis

A Click Sequence Model for Web Search

Getting a better understanding of user behavior is important for advancing information retrieval systems. Existing work focuses on modeling and predicting single interaction events, such as clicks. In this paper, we for the first time focus on modeling and predicting sequences of interaction events. And in particular, sequences of clicks. We formulate the problem of click sequence prediction and propose a click sequence model (CSM) that aims to predict the order in which a user will interact with search engine results. CSM is based on a neural network that follows the encoder-decoder architecture. The encoder computes contextual embeddings of the results. The decoder predicts the sequence of positions of the clicked results. It uses an attention mechanism to extract necessary information about the results at each timestep. We optimize the parameters of CSM by maximizing the likelihood of observed click sequences. We test the effectiveness of CSM on three new tasks: (i) predicting click sequences, (ii) predicting the number of clicks, and (iii) predicting whether or not a user will interact with the results in the order these results are presented on a search engine result page (SERP). Also, we show that CSM achieves state-of-the-art results on a standard click prediction task, where the goal is to predict an unordered set of results a user will click on.

Understanding and Evaluating User Satisfaction with Music Discovery

We study the use and evaluation of a system for supporting music discovery, the experience of finding and listening to content previously unknown to the user. We adopt a mixed methods approach, including interviews, unsupervised learning, survey research, and statistical modeling, to understand and evaluate user satisfaction in the context of discovery. User interviews and survey data show that users' behaviors change according to their goals, such as listening to recommended tracks in the moment, or using recommendations as a starting point for exploration. We use these findings to develop a statistical model of user satisfaction at scale from interactions with a music streaming platform. We show that capturing users' goals, their deviations from their usual behavior, and their peak interactions on individual tracks are informative for estimating user satisfaction. Finally, we present and validate heuristic metrics that are grounded in user experience for online evaluation of recommendation performance. Our findings, supported with evidence from both qualitative and quantitative studies, reveal new insights about user expectations with discovery and their behavioral responses to satisfying and dissatisfying systems.

Identifying Users behind Shared Accounts in Online Streaming Services

Online streaming services are prevalent. Major service providers, such as Netflix (for movies) and Spotify (for music), usually have a large customer base. More often than not, users may share an account. This has attracted increasing attention recently, as account sharing not only compromises the service provider's financial interests but also impairs the performance of recommendation systems and consequently the quality of service provided to the users. To address this issue, this paper focuses on the problem of user identification in shared accounts. Our goal is three-fold: (1) Given an account, along with its historical session logs, we identify a set of users who share such account; (2) Given a new session issued by an account, we find the corresponding user among the identified users of such account; (3) We aim to boost the performance of item recommendation by user identification. While the mapping between users and accounts is unknown, we propose an unsupervised learning-based framework, Session-based Heterogeneous graph Embedding for User Identification (SHE-UI), to differentiate and model the preferences of users in an account, and to group sessions by these users. In SHE-UI, a heterogeneous graph is constructed to represent items such as songs and their available metadata such as artists, genres, and albums. An item-based session embedding technique is proposed using a normalized random walk in the heterogeneous graph. Our experiments conducted on two large-scale music streaming datasets, Last.fm and KKBOX, show that SHE-UI not only accurately identifies users, but also significantly improves the performance of item recommendation over the state-of-the-art methods.

Predicting User Knowledge Gain in Informational Search Sessions

Web search is frequently used by people to acquire new knowledge and to satisfy learning-related objectives. In this context, informational search missions with an intention to obtain knowledge pertaining to a topic are prominent. The importance of learning as an outcome of web search has been recognized. Yet, there is a lack of understanding of the impact of web search on a user's knowledge state. Predicting the knowledge gain of users can be an important step forward if web search engines that are currently optimized for relevance can be molded to serve learning outcomes. In this paper, we introduce a supervised model to predict a user's knowledge state and knowledge gain from features captured during the search sessions. To measure and predict the knowledge gain of users in informational search sessions, we recruited 468 distinct users using crowdsourcing and orchestrated real-world search sessions spanning 11 different topics and information needs. By using scientifically formulated knowledge tests, we calibrated the knowledge of users before and after their search sessions, quantifying their knowledge gain. Our supervised models utilise and derive a comprehensive set of features from the current state of the art and compare performance of a range of feature sets and feature selection strategies. Through our results, we demonstrate the ability to predict and classify the knowledge state and gain using features obtained during search sessions, exhibiting superior performance to an existing baseline in the knowledge state prediction task.

SESSION: Session 1C: Prediction

Dynamic Shard Cutoff Prediction for Selective Search

Selective search architectures use resource selection algorithms such as Rank-S or Taily to rank index shards and determine how many to search for a given query. Most prior research evaluated solutions by their ability to improve efficiency without significantly reducing early-precision metrics such as [email protected] and [email protected] This paper recasts selective search as an early stage of a multi-stage retrieval architecture, which makes recall-oriented metrics more appropriate. A new algorithm is presented that predicts the number of shards that must be searched for a given query in order to meet recall-oriented goals. Decoupling shard ranking from deciding how many shards to search clarifies efficiency vs. effectiveness trade-offs, and enables them to be optimized independently. Experiments on two corpora demonstrate the value of this approach.

Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks

Multivariate time series forecasting is an important machine learning problem across many domains, including predictions of solar plant energy output, electricity consumption, and traffic jam situation. Temporal data arise in these real-world applications often involves a mixture of long-term and short-term patterns, for which traditional approaches such as Autoregressive models and Gaussian Process may fail. In this paper, we proposed a novel deep learning framework, namely Long- and Short-term Time-series network (LSTNet), to address this open challenge. LSTNet uses the Convolution Neural Network (CNN) and the Recurrent Neural Network (RNN) to extract short-term local dependency patterns among variables and to discover long-term patterns for time series trends. Furthermore, we leverage traditional autoregressive model to tackle the scale insensitive problem of the neural network model. In our evaluation on real-world data with complex mixtures of repetitive patterns, LSTNet achieved significant performance improvements over that of several state-of-the-art baseline methods. All the data and experiment codes are available online.

Neural Query Performance Prediction using Weak Supervision from Multiple Signals

Predicting the performance of a search engine for a given query is a fundamental and challenging task in information retrieval. Accurate performance predictors can be used in various ways, such as triggering an action, choosing the most effective ranking function per query, or selecting the best variant from multiple query formulations. In this paper, we propose a general end-to-end query performance prediction framework based on neural networks, called NeuralQPP. Our framework consists of multiple components, each learning a representation suitable for performance prediction. These representations are then aggregated and fed into a prediction sub-network. We train our models with multiple weak supervision signals, which is an unsupervised learning approach that uses the existing unsupervised performance predictors using weak labels. We also propose a simple yet effective component dropout technique to regularize our model. Our experiments on four newswire and web collections demonstrate that NeuralQPP significantly outperforms state-of-the-art baselines, in nearly every case. Furthermore, we thoroughly analyze the effectiveness of each component, each weak supervision signal, and all resulting combinations in our experiments.

Combined Regression and Tripletwise Learning for Conversion Rate Prediction in Real-Time Bidding Advertising

In real-time bidding advertising (RTB), the buyers bid for individual advertisement impressions provided by publishers in real time. The final goal of the buyers is to maximize the return on their investment. To gain higher returns, buyers prefer to first purchase more conversion impressions than click-only ones and then purchase more click-only impressions prior to non-click ones. Simultaneously, to reduce the expense, they need to accurately estimate a reasonable bid price, the predicted precision of which depends on the precision of the predicted conversion rate (CVR) or predicted click-through rate (CTR). Therefore, the predicted CVR or predicted CTR must provide not only good ranking values but also correct regression estimations. This paper is focused on the CVR estimation problem for buy-sides in RTB and a combined regression and tripletwise ranking method (CRT) is proposed that jointly considers regression loss and tripletwise ranking loss to estimate the CVR. This method attempts to rank conversion impressions above click-only ones and simultaneously rank click-only impressions above non-click ones. Meanwhile, through simultaneously utilizing the historical conversion and click information to alleviate sparsity, the CRT method is also aimed to achieve a good two category-ranking performance, as well as a good regression performance for predicting the CVR.

SESSION: Session 1D: Learning to Rank I

From Greedy Selection to Exploratory Decision-Making: Diverse Ranking with Policy-Value Networks

The goal of search result diversification is to select a subset of documents from the candidate set to satisfy as many different subtopics as possible. In general, it is a problem of subset selection and selecting an optimal subset of documents is NP-hard. Existing methods usually formalize the problem as ranking the documents with greedy sequential document selection. At each of the ranking position the document that can provide the largest amount of additional information is selected. It is obvious that the greedy selections inevitably produce suboptimal rankings. In this paper we propose to partially alleviate the problem with a Monte Carlo tree search (MCTS) enhanced Markov decision process (MDP), referred to as M$^2$Div. In M$^2$Div, the construction of diverse ranking is formalized as an MDP process where each action corresponds to selecting a document for one ranking position. Given an MDP state which consists of the query, selected documents, and candidates, a recurrent neural network is utilized to produce the policy function for guiding the document selection and the value function for predicting the whole ranking quality. The produced raw policy and value are then strengthened with MCTS through exploring the possible rankings at the subsequent positions, achieving a better search policy for decision-making. Experimental results based on the TREC benchmarks showed that M$^2$Div can significantly outperform the state-of-the-art baselines based on greedy sequential document selection, indicating the effectiveness of the exploratory decision-making mechanism in M$^2$Div.

Learning a Deep Listwise Context Model for Ranking Refinement

Learning to rank has been intensively studied and widely applied in information retrieval. Typically, a global ranking function is learned from a set of labeled data, which can achieve good performance on average but may be suboptimal for individual queries by ignoring the fact that relevant documents for different queries may have different distributions in the feature space. Inspired by the idea of pseudo relevance feedback where top ranked documents, which we refer as the local ranking context, can provide important information about the query's characteristics, we propose to use the inherent feature distributions of the top results to learn a Deep Listwise Context Model that helps us fine tune the initial ranked list. Specifically, we employ a recurrent neural network to sequentially encode the top results using their feature vectors, learn a local context model and use it to re-rank the top results. There are three merits with our model: (1) Our model can capture the local ranking context based on the complex interactions between top results using a deep neural network; (2) Our model can be built upon existing learning-to-rank methods by directly using their extracted feature vectors; (3) Our model is trained with an attention-based loss function, which is more effective and efficient than many existing listwise methods. Experimental results show that the proposed model can significantly improve the state-of-the-art learning to rank methods on benchmark retrieval corpora.

Efficient Exploration of Gradient Space for Online Learning to Rank

Online learning to rank (OL2R) optimizes the utility of returned search results based on implicit feedback gathered directly from users. In this paper, we accelerate the online learning process by efficient exploration in the gradient space. Our algorithm, named as Null Space Gradient Descent, reduces the exploration space to only the null space of recent poorly performing gradients. This prevents the algorithm from repeatedly exploring directions that have been discouraged by the most recent interactions with users. To improve sensitivity of the resulting interleaved test, we selectively construct candidate rankers to maximize the chance that they can be differentiated by candidate ranking documents in the current query; and we use historically difficult queries to identify the best ranker when tie occurs in comparing the rankers. Extensive experimental comparisons with the state-of-the-art OL2R algorithms on several public benchmarks confirmed the effectiveness of our proposal algorithm, especially in its fast learning convergence and promising ranking quality at an early stage.

Selective Gradient Boosting for Effective Learning to Rank

Learning an effective ranking function from a large number of query-document examples is a challenging task. Indeed, training sets where queries are associated with a few relevant documents and a large number of irrelevant ones are required to model real scenarios of Web search production systems, where a query can possibly retrieve thousands of matching documents, but only a few of them are actually relevant. In this paper, we propose Selective Gradient Boosting (SelGB), an algorithm addressing the Learning-to-Rank task by focusing on those irrelevant documents that are most likely to be mis-ranked, thus severely hindering the quality of the learned model. SelGB exploits a novel technique minimizing the mis-ranking risk, i.e., the probability that two randomly drawn instances are ranked incorrectly, within a gradient boosting process that iteratively generates an additive ensemble of decision trees. Specifically, at every iteration and on a per query basis, SelGB selectively chooses among the training instances a small sample of negative examples enhancing the discriminative power of the learned model. Reproducible and comprehensive experiments conducted on a publicly available dataset show that SelGB exploits the diversity and variety of the negative examples selected to train tree ensembles that outperform models generated by state-of-the-art algorithms by achieving improvements of [email protected] up to 3.2%.

SESSION: Session 2A: Sentiment & Opinion

Explainable Recommendation via Multi-Task Learning in Opinionated Text Data

Explaining automatically generated recommendations allows users to make more informed and accurate decisions about which results to utilize, and therefore improves their satisfaction. In this work, we develop a multi-task learning solution for explainable recommendation. Two companion learning tasks of user preference modeling for recommendation and opinionated content modeling for explanation are integrated via a joint tensor factorization. As a result, the algorithm predicts not only a user's preference over a list of items, i.e., recommendation, but also how the user would appreciate a particular item at the feature level, i.e., opinionated textual explanation. Extensive experiments on two large collections of Amazon and Yelp reviews confirmed the effectiveness of our solution in both recommendation and explanation tasks, compared with several existing recommendation algorithms. And our extensive user study clearly demonstrates the practical value of the explainable recommendations generated by our algorithm.

Sentiment Analysis of Peer Review Texts for Scholarly Papers

Sentiment analysis has been widely explored in many text domains, including product reviews, movie reviews, tweets, and so on. However, there are very few studies trying to perform sentiment analysis in the domain of peer reviews for scholarly papers, which are usually long and introducing both pros and cons of a paper submission. In this paper, we for the first time investigate the task of automatically predicting the overall recommendation/decision (accept, reject, or sometimes borderline) and further identifying the sentences with positive and negative sentiment polarities from a peer review text written by a reviewer for a paper submission. We propose a multiple instance learning network with a novel abstract-based memory mechanism (MILAM) to address this challenging task. Two evaluation datasets are constructed from the ICLR open reviews and evaluation results verified the efficacy of our proposed model. Our model much outperforms a few existing models in different experimental settings. We also find the generally good consistency between the review texts and the recommended decisions, except for the borderline reviews.

SESSION: Session 2B: Social

Attentive Recurrent Social Recommendation

Collaborative filtering(CF) is one of the most popular techniques for building recommender systems. To alleviate the data sparsity issue in CF, social recommendation has emerged by leveraging social influence among users for better recommendation performance. In these systems, users' preferences over time are determined by their temporal dynamic interests as well as the general static interests. In the meantime, the complex interplay between users' internal interests and the social influence from the social network drives the evolution of users' preferences over time. Nevertheless, traditional approaches either neglected the social network structure for temporal recommendation or assumed a static social influence strength for static social recommendation. Thus, the problem of how to leverage social influence to enhance temporal social recommendation performance remains pretty much open. To this end, in this paper, we present an attentive recurrent network based approach for temporal social recommendation. In the proposed approach, we model users' complex dynamic and general static preferences over time by fusing social influence among users with two attention networks. Specifically, in the dynamic preference modeling process, we design a dynamic social aware recurrent neural network to capture users' complex latent interests over time, where a temporal attention network is proposed to learn the temporal social influence over time. In the general static preference modeling process, we characterize each user's static interest by introducing a static social attention network to model the stationary social influence among users. The output of the dynamic preferences and the static preferences are combined together in a unified end-to-end framework for the temporal social recommendation task. Finally, experimental results on two real-world datasets clearly show the superiority of our proposed model compared to the baselines.

Mention Recommendation for Multimodal Microblog with Cross-attention Memory Network

The users of Twitter-like social media normally use the "@'' sign to select a suitable person to mention. It is a significant role in promoting the user experience and information propagation. To help users easily find the usernames they want to mention, the mention recommendation task has received considerable attention in recent years. Previous methods only incorporated textual information when performing this task. However, many users not only post texts on social media but also the corresponding images. These images can provide additional information that is not included in the text, which could be helpful in improving the accuracy of a mention recommendation. To make full use of textual and visual information, we propose a novel cross-attention memory network to perform the mention recommendation task for multimodal tweets. We incorporate the interests of users with external memory and use the cross-attention mechanism to extract both textual and visual information. Experimental results on a dataset collected from Twitter demonstrated that the proposed method can achieve better performance than state-of-the-art methods that use textual information only.

Learning Geo-Social User Topical Profiles with Bayesian Hierarchical User Factorization

Understanding user interests and expertise is a vital component toward creating rich user models for information personalization in social media, recommender systems and web search. To capture the pair-wise interactions between geo-location and user's topical profile in social-spatial systems, we propose the modeling of fine-grained and multi-dimensional user geo-topic profiles. We then propose a two-layered Bayesian hierarchical user factorization generative framework to overcome user heterogeneity and another enhanced model integrated with user's contextual information to alleviate multi-dimensional sparsity. Through extensive experiments, we find the proposed model leads to a 5\textasciitilde13% improvement in precision and recall over the alternative baselines and an additional 6\textasciitilde11% improvement with the integration of user's contexts.

SESSION: Session 2C: App Search & Recommendation

Target Apps Selection: Towards a Unified Search Framework for Mobile Devices

With the recent growth of conversational systems and intelligent assistants such as Apple Siri and Google Assistant, mobile devices are becoming even more pervasive in our lives. As a consequence, users are getting engaged with the mobile apps and frequently search for an information need in their apps. However, users cannot search within their apps through their intelligent assistants. This requires a unified mobile search framework that identifies the target app(s) for the user's query, submits the query to the app(s), and presents the results to the user. In this paper, we take the first step forward towards developing unified mobile search. In more detail, we introduce and study the task of target apps selection, which has various potential real-world applications. To this aim, we analyze attributes of search queries as well as user behaviors, while searching with different mobile apps. The analyses are done based on thousands of queries that we collected through crowdsourcing. We finally study the performance of state-of-the-art retrieval models for this task and propose two simple yet effective neural models that significantly outperform the baselines. Our neural approaches are based on learning high-dimensional representations for mobile apps. Our analyses and experiments suggest specific future directions in this research area.

SESSION: Session 2D: Conversational Systems

Dialogue Act Recognition via CRF-Attentive Structured Network

Dialogue Act Recognition (DAR) is a challenging problem in dialogue interpretation, which aims to associate semantic labels to utterances and characterize the speaker's intention. Currently, many existing approaches formulate the DAR problem ranging from multi-classification to structured prediction, which suffer from handcrafted feature extensions and attentive contextual dependencies. In this paper, we tackle the problem of DAR from the viewpoint of extending richer Conditional Random Field (CRF) structured dependencies without abandoning end-to-end training. We incorporate hierarchical semantic inference with memory mechanism on the utterance modeling at multiple levels. We then utilize the structured attention network on the linear-chain CRF to dynamically separate the utterances into cliques. The extensive experiments on two primary benchmark datasets Switchboard Dialogue Act (SWDA) and Meeting Recorder Dialogue Act (MRDA) datasets show that our method achieves better performance than other state-of-the-art solutions to the problem.

Conversational Recommender System

A personalized conversational sales agent could have much commercial potential. E-commerce companies such as Amazon, eBay, JD, Alibaba etc. are piloting such kind of agents with their users. However, the research on this topic is very limited and existing solutions are either based on single round adhoc search engine or traditional multi round dialog system. They usually only utilize user inputs in the current session, ignoring users' long term preferences. On the other hand, it is well known that sales conversion rate can be greatly improved based on recommender systems, which learn user preferences based on past purchasing behavior and optimize business oriented metrics such as conversion rate or expected revenue. In this work, we propose to integrate research in dialog systems and recommender systems into a novel and unified deep reinforcement learning framework to build a personalized conversational recommendation agent that optimizes a per session based utility function. In particular, we propose to represent a user conversation history as a semi-structured user query with facet-value pairs. This query is generated and updated by belief tracker that analyzes natural language utterances of user at each step. We propose a set of machine actions tailored for recommendation agents and train a deep policy network to decide which action (i.e. asking for the value of a facet or making a recommendation) the agent should take at each step. We train a personalized recommendation model that uses both the user's past ratings and user query collected in the current conversational session when making rating predictions and generating recommendations. Such a conversational system often tries to collect user preferences by asking questions. Once enough user preference is collected, it makes personalized recommendations to the user. We perform both simulation experiments and real online user studies to demonstrate the effectiveness of the proposed framework.

Response Ranking with Deep Matching Networks and External Knowledge in Information-seeking Conversation Systems

Intelligent personal assistant systems with either text-based or voice-based conversational interfaces are becoming increasingly popular around the world. Retrieval-based conversation models have the advantages of returning fluent and informative responses. Most existing studies in this area are on open domain ''chit-chat'' conversations or task / transaction oriented conversations. More research is needed for information-seeking conversations. There is also a lack of modeling external knowledge beyond the dialog utterances among current conversational models. In this paper, we propose a learning framework on the top of deep neural matching networks that leverages external knowledge for response ranking in information-seeking conversation systems. We incorporate external knowledge into deep neural models with pseudo-relevance feedback and QA correspondence knowledge distillation. Extensive experiments with three information-seeking conversation data sets including both open benchmarks and commercial data show that, our methods outperform various baseline methods including several deep text matching models and the state-of-the-art method on response selection in multi-turn conversations. We also perform analysis over different response types, model variations and ranking examples. Our models and research findings provide new insights on how to utilize external knowledge with deep neural models for response selection and have implications for the design of the next generation of information-seeking conversation systems.

Chat More: Deepening and Widening the Chatting Topic via A Deep Model

The past decade has witnessed the boom of human-machine interactions, particularly via dialog systems. In this paper, we study the task of response generation in open-domain multi-turn dialog systems. Many research efforts have been dedicated to building intelligent dialog systems, yet few shed light on deepening or widening the chatting topics in a conversational session, which would attract users to talk more. To this end, this paper presents a novel deep scheme consisting of three channels, namely global, wide, and deep ones. The global channel encodes the complete historical information within the given context, the wide one employs an attention-based recurrent neural network model to predict the keywords that may not appear in the historical context, and the deep one trains a Multi-layer Perceptron model to select some keywords for an in-depth discussion. Thereafter, our scheme integrates the outputs of these three channels to generate desired responses. To justify our model, we conducted extensive experiments to compare our model with several state-of-the-art baselines on two datasets: one is constructed by ourselves and the other is a public benchmark dataset. Experimental results demonstrate that our model yields promising performance by widening or deepening the topics of interest.

SESSION: Session 3A: Social Good

Identifying Sub-events and Summarizing Disaster-Related Information from Microblogs

In recent times, humanitarian organizations increasingly rely on social media to search for information useful for disaster response. These organizations have varying information needs ranging from general situational awareness (i.e., to understand a bigger picture) to focused information needs e.g., about infrastructure damage, urgent needs of affected people. This research proposes a novel approach to help crisis responders fulfill their information needs at different levels of granularities. Specifically, the proposed approach presents simple algorithms to identify sub-events and generate summaries of big volume of messages around those events using an Integer Linear Programming (ILP) technique. Extensive evaluation on a large set of real world Twitter dataset shows (a). our algorithm can identify important sub-events with high recall (b). the summarization scheme shows (6---30%) higher accuracy of our system compared to many other state-of-the-art techniques. The simplicity of the algorithms ensures that the entire task is done in real time which is needed for practical deployment of the system.

The Rise of Guardians: Fact-checking URL Recommendation to Combat Fake News

A large body of research work and efforts have been focused on detecting fake news and building online fact-check systems in order to debunk fake news as soon as possible. Despite the existence of these systems, fake news is still wildly shared by online users. It indicates that these systems may not be fully utilized. After detecting fake news, what is the next step to stop people from sharing it? How can we improve the utilization of these fact-check systems? To fill this gap, in this paper, we (i) collect and analyze online users called guardians, who correct misinformation and fake news in online discussions by referring fact-checking URLs; and (ii) propose a novel fact-checking URL recommendation model to encourage the guardians to engage more in fact-checking activities. We found that the guardians usually took less than one day to reply to claims in online conversations and took another day to spread verified information to hundreds of millions of followers. Our proposed recommendation model outperformed four state-of-the-art models by 11%~33%. Our source code and dataset are available at http://web.cs.wpi.edu/~kmlee/data/gau.html.

SESSION: Session 3B: Privacy

Intent-aware Query Obfuscation for Privacy Protection in Personalized Web Search

Modern web search engines exploit users' search history to personalize search results, with a goal of improving their service utility on a per-user basis. But it is this very dimension that leads to the risk of privacy infringement and raises serious public concerns. In this work, we propose a client-centered intent-aware query obfuscation solution for protecting user privacy in a personalized web search scenario. In our solution, each user query is submitted with l additional cover queries and corresponding clicks, which act as decoys to mask users' genuine search intent from a search engine. The cover queries are sequentially sampled from a set of hierarchically organized language models to ensure the coherency of fake search intents in a cover search task. Our approach emphasizes the plausibility of generated cover queries, not only to the current genuine query but also to previous queries in the same task, to increase the complexity for a search engine to identify a user's true intent. We also develop two new metrics from an information theoretic perspective to evaluate the effectiveness of provided privacy protection. Comprehensive experiment comparisons with state-of-the-art query obfuscation techniques are performed on the public AOL search log, and the propitious results substantiate the effectiveness of our solution.

A Personal Privacy Preserving Framework: I Let You Know Who Can See What

The booming of social networks has given rise to a large volume of user-generated contents (UGCs), most of which are free and publicly available. A lot of users' personal aspects can be extracted from these UGCs to facilitate personalized applications as validated by many previous studies. Despite their value, UGCs can place users at high privacy risks, which thus far remains largely untapped. Privacy is defined as the individual's ability to control what information is disclosed, to whom, when and under what circumstances. As people and information both play significant roles, privacy has been elaborated as a boundary regulation process, where individuals regulate interaction with others by altering the openness degree of themselves to others. In this paper, we aim to reduce users' privacy risks on social networks by answering the question of Who Can See What. Towards this goal, we present a novel scheme, comprising of descriptive, predictive and prescriptive components. In particular, we first collect a set of posts and extract a group of privacy-oriented features to describe the posts. We then propose a novel taxonomy-guided multi-task learning model to predict which personal aspects are uncovered by the posts. Lastly, we construct standard guidelines by the user study with 400 users to regularize users' actions for preventing their privacy leakage. Extensive experiments on a real-world dataset well verified our scheme.

SynTF: Synthetic and Differentially Private Term Frequency Vectors for Privacy-Preserving Text Mining

Text mining and information retrieval techniques have been developed to assist us with analyzing, organizing and retrieving documents with the help of computers. In many cases, it is desirable that the authors of such documents remain anonymous: Search logs can reveal sensitive details about a user, critical articles or messages about a company or government might have severe or fatal consequences for a critic, and negative feedback in customer surveys might negatively impact business relations if they are identified. Simply removing personally identifying information from a document is, however, insufficient to protect the writer's identity: Given some reference texts of suspect authors, so-called authorship attribution methods can reidentfy the author from the text itself. One of the most prominent models to represent documents in many common text mining and information retrieval tasks is the vector space model where each document is represented as a vector, typically containing its term frequencies or related quantities. We therefore propose an automated text anonymization approach that produces synthetic term frequency vectors for the input documents that can be used in lieu of the original vectors. We evaluate our method on an exemplary text classification task and demonstrate that it only has a low impact on its accuracy. In contrast, we show that our method strongly affects authorship attribution techniques to the level that they become infeasible with a much stronger decline in accuracy. Other than previous authorship obfuscation methods, our approach is the first that fulfills differential privacy and hence comes with a provable plausible deniability guarantee.

Privacy-aware Ranking with Tree Ensembles on the Cloud

Tree-based ensembles are widely used for document ranking but supporting such a method efficiently under a privacy-preserving constraint on the cloud is an open research problem. The main challenge is that letting the cloud server perform ranking computation may unsafely reveal privacy-sensitive information. To address privacy with tree-based server-side ranking, this paper proposes to reduce the learning-to-rank model dependence on composite features as a trade-off, and develops comparison-preserving mapping to hide feature values and tree thresholds. To justify the above approach, the presented analysis shows that a decision tree with simplifiable composite features can be transformed into another tree using raw features without increasing the training accuracy loss. This paper analyzes the privacy properties of the proposed scheme, and compares the relevance of gradient boosting regression trees, LambdaMART, and random forests using raw features for several test data sets under the privacy consideration, and assesses the competitiveness of a hybrid model based on these algorithms.

SESSION: Session 3C: Question Answering

Multihop Attention Networks for Question Answer Matching

Attention based neural network models have been successfully applied in answer selection, which is an important subtask of question answering (QA). These models often represent a question by a single vector and find its corresponding matches by attending to candidate answers. However, questions and answers might be related to each other in complicated ways which cannot be captured by single-vector representations. In this paper, we propose Multihop Attention Networks (MAN) which aim to uncover these complex relations for ranking question and answer pairs. Unlike previous models, we do not collapse the question into a single vector, instead we use multiple vectors which focus on different parts of the question for its overall semantic representation and apply multiple steps of attention to learn representations for the candidate answers. For each attention step, in addition to common attention mechanisms, we adopt sequential attention which utilizes context information for computing context-aware attention weights. Via extensive experiments, we show that MAN outperforms state-of-the-art approaches on popular benchmark QA datasets. Empirical studies confirm the effectiveness of sequential attention over other attention mechanisms.

Ranking Documents by Answer-Passage Quality

Evidence derived from passages that closely represent likely answers to a posed query can be useful input to the ranking process. Based on a novel use of Community Question Answering data, we present an approach for the creation of such passages. A general framework for extracting answer passages and estimating their quality is proposed, and this evidence is integrated into ranking models. Our experiments on two web collections show that such quality estimates from answer passages provide a strong indication of document relevance and compare favorably to previous passage-based methods. Combining such evidence can significantly improve over a set of state-of-the-art ranking models, including Quality-Biased Ranking, External Expansion, and a combination of both. A final ranking model that incorporates all quality estimates achieves further improvements on both collections.

Characterizing and Supporting Question Answering in Human-to-Human Communication

Email continues to be one of the most important means of online communication. People spend a significant amount of time sending, reading, searching and responding to email in order to manage tasks, exchange information, etc. In this paper, we focus on information exchange over enterprise email in the form of questions and answers. We study a large scale publicly available email dataset to characterize information exchange via questions and answers in enterprise email. We augment our analysis with a survey to gain insights on the types of questions exchanged, when and how do people get back to them and whether this behavior is adequately supported by existing email management and search functionality. We leverage this understanding to define the task of extracting question/answer pairs from threaded email conversations. We propose a neural network based approach that matches the question to the answer considering comparisons at different levels of granularity. We also show that we can improve the performance by leveraging external data of question and answer pairs. We test our approach using a manually labeled email data collected using a crowd-sourcing annotation study. Our findings have implications for designing email clients and intelligent agents that support question answering and information lookup in email.

SESSION: Session 3D: Learning to Rank II

Adversarial Personalized Ranking for Recommendation

Item recommendation is a personalized ranking task. To this end, many recommender systems optimize models with pairwise ranking objectives, such as the Bayesian Personalized Ranking (BPR). Using matrix Factorization (MF) - the most widely used model in recommendation - as a demonstration, we show that optimizing it with BPR leads to a recommender model that is not robust. In particular, we find that the resultant model is highly vulnerable to adversarial perturbations on its model parameters, which implies the possibly large error in generalization. To enhance the robustness of a recommender model and thus improve its generalization performance, we propose a new optimization framework, namely Adversarial Personalized Ranking (APR). In short, our APR enhances the pairwise ranking method BPR by performing adversarial training. It can be interpreted as playing a minimax game, where the minimization of the BPR objective function meanwhile defends an adversary, which adds adversarial perturbations on model parameters to maximize the BPR objective function. To illustrate how it works, we implement APR on MF by adding adversarial perturbations on the embedding vectors of users and items. Extensive experiments on three public real-world datasets demonstrate the effectiveness of APR - by optimizing MF with APR, it outperforms BPR with a relative improvement of 11.2% on average and achieves state-of-the-art performance for item recommendation. Our implementation is available at: \urlhttps://github.com/hexiangnan/adversarial_personalized_ranking.

Turning Clicks into Purchases: Revenue Optimization for Product Search in E-Commerce

In recent years, product search engines have emerged as a key factor for online businesses. According to a recent survey, over 55% of online customers begin their online shopping journey by searching on an E-Commerce (EC) website like Amazon as opposed to a generic web search engine like Google. Information retrieval research to date has been focused on optimizing search ranking algorithms for web documents while little attention has been paid to product search. There are several intrinsic differences between web search and product search that make the direct application of traditional search ranking algorithms to EC search platforms difficult. First, the success of web and product search is measured differently; one seeks to optimize for relevance while the other must optimize for both relevance and revenue. Second, when using real-world EC transaction data, there is no access to manually annotated labels. In this paper, we address these differences with a novel learning framework for EC product search called LETORIF (LEarning TO Rank with Implicit Feedback). In this framework, we utilize implicit user feedback signals (such as user clicks and purchases) and jointly model the different stages of the shopping journey to optimize for EC sales revenue. We conduct experiments on real-world EC transaction data and introduce a a new evaluation metric to estimate expected revenue after re-ranking. Experimental results show that LETORIF outperforms top competitors in improving purchase rates and total revenue earned.

Modeling Diverse Relevance Patterns in Ad-hoc Retrieval

Assessing relevance between a query and a document is challenging in ad-hoc retrieval due to its diverse patterns, i.e., a document could be relevant to a query as a whole or partially as long as it provides sufficient information for users' need. Such diverse relevance patterns require an ideal retrieval model to be able to assess relevance in the right granularity adaptively. Unfortunately, most existing retrieval models compute relevance at a single granularity, either document-wide or passage-level, or use fixed combination strategy, restricting their ability in capturing diverse relevance patterns. In this work, we propose a data-driven method to allow relevance signals at different granularities to compete with each other for final relevance assessment. Specifically, we propose a HIerarchical Neural maTching model (HiNT) which consists of two stacked components, namely local matching layer and global decision layer. The local matching layer focuses on producing a set of local relevance signals by modeling the semantic matching between a query and each passage of a document. The global decision layer accumulates local signals into different granularities and allows them to compete with each other to decide the final relevance score.Experimental results demonstrate that our HiNT model outperforms existing state-of-the-art retrieval models significantly on benchmark ad-hoc retrieval datasets.

SESSION: Session 4A: Fairness & Robustness

Unbiased Learning to Rank with Unbiased Propensity Estimation

Learning to rank with biased click data is a well-known challenge. A variety of methods has been explored to debias click data for learning to rank such as click models, result interleaving and, more recently, the unbiased learning-to-rank framework based on inverse propensity weighting. Despite their differences, most existing studies separate the estimation of click bias (namely the propensity model ) from the learning of ranking algorithms. To estimate click propensities, they either conduct online result randomization, which can negatively affect the user experience, or offline parameter estimation, which has special requirements for click data and is optimized for objectives (e.g. click likelihood) that are not directly related to the ranking performance of the system. In this work, we address those problems by unifying the learning of propensity models and ranking models. We find that the problem of estimating a propensity model from click data is a dual problem of unbiased learning to rank. Based on this observation, we propose a Dual Learning Algorithm (DLA) that jointly learns an unbiased ranker and an unbiased propensity model. DLA is an automatic unbiased learning-to-rank framework as it directly learns unbiased ranking models from biased click data without any preprocessing. It can adapt to the change of bias distributions and is applicable to online learning. Our empirical experiments with synthetic and real-world data show that the models trained with DLA significantly outperformed the unbiased learning-to-rank algorithms based on result randomization and the models trained with relevance signals extracted by click models.

Ranking Robustness Under Adversarial Document Manipulations

For many queries in the Web retrieval setting there is an on-going ranking competition: authors manipulate their documents so as to promote them in rankings. Such competitions can have unwarranted effects not only in terms of retrieval effectiveness, but also in terms of ranking robustness. A case in point, rankings can (rapidly) change due to small indiscernible perturbations of documents. While there has been a recent growing interest in analyzing the robustness of classifiers to adversarial manipulations, there has not yet been a study of the robustness of relevance-ranking functions. We address this challenge by formally analyzing different definitions and aspects of the robustness of learning-to-rank-based ranking functions. For example, we formally show that increased regularization of linear ranking functions increases ranking robustness. This finding leads us to conjecture that decreased variance of any ranking function results in increased robustness. We propose several measures for quantifying ranking robustness and use them to analyze ranking competitions between documents' authors. The empirical findings support our formal analysis and conjecture for both RankSVM and LambdaMART.

Equity of Attention: Amortizing Individual Fairness in Rankings

Rankings of people and items are at the heart of selection-making, match-making, and recommender systems, ranging from employment sites to sharing economy platforms. As ranking positions influence the amount of attention the ranked subjects receive, biases in rankings can lead to unfair distribution of opportunities and resources such as jobs or income. This paper proposes new measures and mechanisms to quantify and mitigate unfairness from a bias inherent to all rankings, namely, the position bias which leads to disproportionately less attention being paid to low-ranked subjects. Our approach differs from recent fair ranking approaches in two important ways. First, existing works measure unfairness at the level of subject groups while our measures capture unfairness at the level of individual subjects, and as such subsume group unfairness. Second, as no single ranking can achieve individual attention fairness, we propose a novel mechanism that achieves amortized fairness, where attention accumulated across a series of rankings is proportional to accumulated relevance. We formulate the challenge of achieving amortized individual fairness subject to constraints on ranking quality as an online optimization problem and show that it can be solved as an integer linear program. Our experimental evaluation reveals that unfair attention distribution in rankings can be substantial, and demonstrates that our method can improve individual fairness while retaining high ranking quality.

Should I Follow the Crowd?: A Probabilistic Analysis of the Effectiveness of Popularity in Recommender Systems

The use of IR methodology in the evaluation of recommender systems has become common practice in recent years. IR metrics have been found however to be strongly biased towards rewarding algorithms that recommend popular items "the same bias that state of the art recommendation algorithms display. Recent research has confirmed and measured such biases, and proposed methods to avoid them. The fundamental question remains open though whether popularity is really a bias we should avoid or not; whether it could be a useful and reliable signal in recommendation, or it may be unfairly rewarded by the experimental biases. We address this question at a formal level by identifying and modeling the conditions that can determine the answer, in terms of dependencies between key random variables, involving item rating, discovery and relevance. We find conditions that guarantee popularity to be effective or quite the opposite, and for the measured metric values to reflect a true effectiveness, or qualitatively deviate from it. We exemplify and confirm the theoretical findings with empirical results. We build a crowdsourced dataset devoid of the usual biases displayed by common publicly available data, in which we illustrate contradictions between the accuracy that would be measured in a common biased offline experimental setting, and the actual accuracy that can be measured with unbiased observations.

SESSION: Session 4B: Behavior

Constructing an Interaction Behavior Model for Web Image Search

User interaction behavior is a valuable source of implicit relevance feedback. In Web image search a different type of search result presentation is used than in general Web search, which leads to different interaction mechanisms and user behavior. For example, image search results are self-contained, so that users do not need to click the results to view the landing page as in general Web search, which generates sparse click data. Also, two-dimensional result placement instead of a linear result list makes browsing behaviors more complex. Thus, it is hard to apply standard user behavior models (e.g., click models) developed for general Web search to Web image search. In this paper, we conduct a comprehensive image search user behavior analysis using data from a lab-based user study as well as data from a commercial search log. We then propose a novel interaction behavior model, called grid-based user browsing model (GUBM), whose design is motivated by observations from our data analysis. GUBM can both capture users' interaction behavior, including cursor hovering, and alleviate position bias. The advantages of GUBM are two-fold: (1) It is based on an unsupervised learning method and does not need manually annotated data for training. (2) It is based on user interaction features on search engine result pages (SERPs) and is easily transferable to other scenarios that have a grid-based interface such as video search engines. We conduct extensive experiments to test the performance of our model using a large-scale commercial image search log. Experimental results show that in terms of behavior prediction (perplexity), and topical relevance and image quality (normalized discounted cumulative gain (NDCG)), GUBM outperforms state-of-the-art baseline models as well as the original ranking. We make the implementation of GUBM and related datasets publicly available for future studies.

Between Clicks and Satisfaction: Study on Multi-Phase User Preferences and Satisfaction for Online News Reading

Click signal has been widely used for designing and evaluating interactive information systems, which is taken as the indicator of user preference. However, click signal does not capture post-click user experience. Very commonly, the user first clicked an item and then found it is not what he wanted after reading its content, which shows there is a gap between user click and user actual preference. Previous studies on web search have incorporated other user behaviors, such as dwell time, to reduce the gap. Unfortunately, for other scenarios such as recommendation and online news reading, there still lacks a thorough understanding of the relationship between click and user preference, and the corresponding reasons which are the focus of this work. Based on an in-depth laboratory user study of online news reading scenario in the mobile environment, we show that click signal does not align with user preference. Besides, we find that user preference changes frequently, hence preferences in three phases are proposed: Before-Read Preference, After-Read Preference and Post-task Preference. In addition, the statistic analysis shows that the changes are highly related to news quality and the context of user interactions. Meanwhile, many other user behaviors, like viewport time, dwell time, and read speed, are found reflecting user preference in different phases. Furthermore, with the help of various kinds of user behaviors, news quality, and interaction context, we build an effective model to predict whether the user actually likes the clicked news. Finally, we replace binary click signals of traditional click-based evaluation metrics, like Click-Through Rate, with the predicted item-level preference, and significant improvements are achieved in estimating the user's list-level satisfaction. Our work sheds light on the understanding of user click behaviors and provides a method for better estimating user interest and satisfaction. The proposed model could also be helpful to various recommendation tasks in mobile scenarios.

The Effects of Manipulating Task Determinability on Search Behaviors and Outcomes

An important area of IR research involves understanding how task characteristics influence search behaviors and outcomes. Task complexity is one characteristic that has received considerable attention. One view of task complexity is through the lens of a priori determinability -- the level of uncertainty about task outcomes and processes experienced by the searcher. In this work, we manipulated the determinability of comparative tasks. Our task manipulation involved modifying the scope of the task by specifying exact items and/or exact (objective or subjective) dimensions to consider as part of the task. This paper reports on a within-subject study (N=144) where we investigated how our task manipulation influenced participants' perceptions, levels of engagement, search effort, and choice of search strategies. Our results suggest a complex relationship between task scope, determinability, and different outcome measures. Our most open-ended tasks were perceived to have low determinability (high uncertainty), but were the least challenging for participants due to satisficing. Furthermore, narrowing the scope of tasks by specifying items had a different effect than by specifying dimensions. Specifying items increased the task determinability (lower uncertainty) and made the task easier, while specifying dimensions did not increase the task determinability and made the task more challenging. A qualitative analysis of participants' queries suggests that searching for dimensions is more challenging than for items. Finally, we observed subtle differences between objective and subjective dimensions. We discuss implications for the design of IIR studies and tools to support users.

SESSION: Session 4C: Medical & Legal IR

Seed-driven Document Ranking for Systematic Reviews in Evidence-Based Medicine

Systematic review (SR) in evidence-based medicine is a literature review which provides a conclusion to a specific clinical question. To assure credible and reproducible conclusions, SRs are conducted by well-defined steps. One of the key steps, the screening step, is to identify relevant documents from a pool of candidate documents. Typically about 2000 candidate documents will be retrieved from databases using keyword queries for a SR. From which, about 20 relevant documents are manually identified by SR experts, based on detailed relevance conditions or eligibility criteria. Recent studies show that document ranking, or screening prioritization, is a promising way to improve the manual screening process. In this paper, we propose a seed-driven document ranking (SDR) model for effective screening, with the assumption that one relevant document is known, i.e., the seed document. Based on a detailed analysis of characteristics of relevant documents, SDR represents documents using bag of clinical terms, rather than the commonly used bag of words. More importantly, we propose a method to estimate the importance of the clinical terms based on their distribution in candidate documents. On benchmark dataset released by CLEF'17 eHealth Task 2, we show that the proposed SDR outperforms state-of-the-art solutions. Interestingly, we also observe that ranking based on word embedding representation of documents well complements SDR. The best ranking is achieved by combining the relevances estimated by SDR and by word embedding. Additionally, we report results of simulating the manual screening process with SDR.

A Dataset and an Examination of Identifying Passages for Due Diligence

We present and formalize the due diligence problem, where lawyers extract data from legal documents to assess risk in a potential merger or acquisition, as an information retrieval task. Furthermore, we describe the creation and annotation of a document collection for the due diligence problem that will foster research in this area. This dataset comprises 50 topics over 4,412 documents and ~15 million sentences and is a subset of our own internal training data. Using this dataset, we present what we have found to be the state of the art for information extraction in the due diligence problem. In particular, we find that when treating documents as sequences of labelled and unlabelled sentences, Conditional Random Fields significantly and substantially outperform other techniques for sequence-based (Hidden Markov Models) and non-sequence based machine learning (logistic regression). Included in this is an analysis of what we perceive to be the major failure cases when extraction is performed based upon sentence labels.

Generating Better Queries for Systematic Reviews

Systematic reviews form the cornerstone of evidence based medicine, aiming to answer complex medical questions based on all evidence currently available. Key to the effectiveness of a systematic review is an (often large) Boolean query used to search large publication repositories. These Boolean queries are carefully crafted by researchers and information specialists, and often reviewed by a panel of experts. However, little is known about the effectiveness of the Boolean queries at the time of formulation. In this paper we investigate whether a better Boolean query than that defined in the protocol of a systematic review, can be created, and we develop methods for the transformation of a given Boolean query into a more effective one. Our approach involves defining possible transformations of Boolean queries and their clauses. It also involves casting the problem of identifying a transformed query that is better than the original into: (i) a classification problem; and (ii) a learning to rank problem. Empirical experiments are conducted on a real set of systematic reviews. Analysis of results shows that query transformations that are better than the original queries do exist, and that our approaches are able to select more effective queries from the set of possible transformed queries so as to maximise different target effectiveness measures.

Modeling Dynamic Pairwise Attention for Crime Classification over Legal Articles

In juridical field, judges usually need to consult several relevant cases to determine the specific articles that the evidence violated, which is a task that is time consuming and needs extensive professional knowledge. In this paper, we focus on how to save the manual efforts and make the conviction process more efficient. Specifically, we treat the evidences as documents, and articles as labels, thus the conviction process can be cast as a multi-label classification problem. However, the challenge in this specific scenario lies in two aspects. One is that the number of articles that evidences violated is dynamic, which we denote as the label dynamic problem. The other is that most articles are violated by only a few of the evidences, which we denote as the label imbalance problem. Previous methods usually learn the multi-label classification model and the label thresholds independently, and may ignore the label imbalance problem. To tackle with both challenges, we propose a unified D ynamic P airwise A ttention M odel (DPAM for short) in this paper. Specifically, DPAM adopts the multi-task learning paradigm to learn the multi-label classifier and the threshold predictor jointly, and thus DPAM can improve the generalization performance by leveraging the information learned in both of the two tasks. In addition, a pairwise attention model based on article definitions is incorporated into the classification model to help alleviate the label imbalance problem. Experimental results on two real-world datasets show that our proposed approach significantly outperforms state-of-the-art multi-label classification methods.

SESSION: Session 4D: Recommender Systems - Methods

Learning Contextual Bandits in a Non-stationary Environment

Multi-armed bandit algorithms have become a reference solution for handling the explore/exploit dilemma in recommender systems, and many other important real-world problems, such as display advertisement. However, such algorithms usually assume a stationary reward distribution, which hardly holds in practice as users' preferences are dynamic. This inevitably costs a recommender system consistent suboptimal performance. In this paper, we consider the situation where the underlying distribution of reward remains unchanged over (possibly short) epochs and shifts at unknown time instants. In accordance, we propose a contextual bandit algorithm that detects possible changes of environment based on its reward estimation confidence and updates its arm selection strategy respectively. Rigorous upper regret bound analysis of the proposed algorithm demonstrates its learning effectiveness in such a non-trivial environment. Extensive empirical evaluations on both synthetic and real-world datasets for recommendation confirm its practical utility in a changing environment.

Improving Sequential Recommendation with Knowledge-Enhanced Memory Networks

With the revival of neural networks, many studies try to adapt powerful sequential neural models, ıe Recurrent Neural Networks (RNN), to sequential recommendation. RNN-based networks encode historical interaction records into a hidden state vector. Although the state vector is able to encode sequential dependency, it still has limited representation power in capturing complicated user preference. It is difficult to capture fine-grained user preference from the interaction sequence. Furthermore, the latent vector representation is usually hard to understand and explain. To address these issues, in this paper, we propose a novel knowledge enhanced sequential recommender. Our model integrates the RNN-based networks with Key-Value Memory Network (KV-MN). We further incorporate knowledge base (KB) information to enhance the semantic representation of KV-MN. RNN-based models are good at capturing sequential user preference, while knowledge-enhanced KV-MNs are good at capturing attribute-level user preference. By using a hybrid of RNNs and KV-MNs, it is expected to be endowed with both benefits from these two components. The sequential preference representation together with the attribute-level preference representation are combined as the final representation of user preference. With the incorporation of KB information, our model is also highly interpretable. To our knowledge, it is the first time that sequential recommender is integrated with external memories by leveraging large-scale KB information.

Collaborative Memory Network for Recommendation Systems

Recommendation systems play a vital role to keep users engaged with personalized content in modern online platforms. Deep learning has revolutionized many research fields and there is a recent surge of interest in applying it to collaborative filtering (CF). However, existing methods compose deep learning architectures with the latent factor model ignoring a major class of CF models, neighborhood or memory-based approaches. We propose Collaborative Memory Networks (CMN), a deep architecture to unify the two classes of CF models capitalizing on the strengths of the global structure of latent factor model and local neighborhood-based structure in a nonlinear fashion. Motivated by the success of Memory Networks, we fuse a memory component and neural attention mechanism as the neighborhood component. The associative addressing scheme with the user and item memories in the memory module encodes complex user-item relations coupled with the neural attention mechanism to learn a user-item specific neighborhood. Finally, the output module jointly exploits the neighborhood with the user and item memories to produce the ranking score. Stacking multiple memory modules together yield deeper architectures capturing increasingly complex user-item relations. Furthermore, we show strong connections between CMN components, memory networks and the three classes of CF models. Comprehensive experimental results demonstrate the effectiveness of CMN on three public datasets outperforming competitive baselines. Qualitative visualization of the attention weights provide insight into the model's recommendation process and suggest the presence of higher order interactions.

Streaming Ranking Based Recommender Systems

Studying recommender systems under streaming scenarios has become increasingly important because real-world applications produce data continuously and rapidly. However, most existing recommender systems today are designed in the context of an offline setting. Compared with the traditional recommender systems, large-volume and high-velocity are posing severe challenges for streaming recommender systems. In this paper, we investigate the problem of streaming recommendations being subject to higher input rates than they can immediately process with their available system resources (i.e., CPU and memory). In particular, we provide a principled framework called as SPMF (Stream-centered Probabilistic Matrix Factorization model), based on BPR (Bayesian Personalized Ranking) optimization framework, for performing efficient ranking based recommendations in stream settings. Experiments on three real-world datasets illustrate the superiority of SPMF in online recommendations.

SESSION: Session 5A: Location & Trajectory

Torch: A Search Engine for Trajectory Data

This paper presents a new trajectory search engine called Torch for querying road network trajectory data. Torch is able to efficiently process two types of typical queries (similarity search and Boolean search), and support a wide variety of trajectory similarity functions. Additionally, we propose a new similarity function LORS in Torch to measure the similarity in a more effective and efficient manner. Indexing and search in Torch works as follows. First, each raw vehicle trajectory is transformed to a set of road segments (edges) and a set of crossings (vertices) on the road network. Then a lightweight edge and vertex index called LEVI is built. Given a query, a filtering framework over LEVI is used to dynamically prune the trajectory search space based on the similarity measure imposed. Finally, the result set (ranked or Boolean) is returned. Extensive experiments on real trajectory datasets verify the effectiveness and efficiency of Torch.

Top-k Route Search through Submodularity Modeling of Recurrent POI Features

We consider a practical top-k route search problem: given a collection of points of interest (POIs) with rated features and traveling costs between POIs, a user wants to find k routes from a source to a destination and limited in a cost budget, that maximally match her needs on feature preferences. One challenge is dealing with the personalized diversity requirement where users have various trade-off between quantity (the number of POIs with a specified feature) and variety (the coverage of specified features). Another challenge is the large scale of the POI map and the great many alternative routes to search. We model the personalized diversity requirement by the whole class of submodular functions, and present an optimal solution to the top-k route search problem through indices for retrieving relevant POIs in both feature and route spaces and various strategies for pruning the search space using user preferences and constraints. We also present promising heuristic solutions and evaluate all the solutions on real life data.

A Contextual Attention Recurrent Architecture for Context-Aware Venue Recommendation

Venue recommendation systems aim to effectively rank a list of interesting venues users should visit based on their historical feedback (e.g. checkins). Such systems are increasingly deployed by Location-based Social Networks (LBSNs) such as Foursquare and Yelp to enhance their usefulness to users. Recently, various RNN architectures have been proposed to incorporate contextual information associated with the users' sequence of checkins (e.g. time of the day, location of venues) to effectively capture the users' dynamic preferences. However, these architectures assume that different types of contexts have an identical impact on the users' preferences, which may not hold in practice. For example, an ordinary context such as the time of the day reflects the user's current contextual preferences, whereas a transition context - such as a time interval from their last visited venue - indicates a transition effect from past behaviour to future behaviour. To address these challenges, we propose a novel Contextual Attention Recurrent Architecture (CARA) that leverages both sequences of feedback and contextual information associated with the sequences to capture the users' dynamic preferences. Our proposed recurrent architecture consists of two types of gating mechanisms, namely 1) a contextual attention gate that controls the influence of the ordinary context on the users' contextual preferences and 2) a time- and geo-based gate that controls the influence of the hidden state from the previous checkin based on the transition context. Thorough experiments on three large checkin and rating datasets from commercial LBSNs demonstrate the effectiveness of our proposed CARA architecture by significantly outperforming many state-of-the-art RNN architectures and factorisation approaches.

SESSION: Session 5B: Entities

Entity Set Search of Scientific Literature: An Unsupervised Ranking Approach

Literature search is critical for any scientific research. Different from Web or general domain search, a large portion of queries in scientific literature search are entity-set queries, that is, multiple entities of possibly different types. Entity-set queries reflect user's need for finding documents that contain multiple entities and reveal inter-entity relationships and thus pose non-trivial challenges to existing search algorithms that model each entity separately. However, entity-set queries are usually sparse (i.e., not so repetitive), which makes ineffective many supervised ranking models that rely heavily on associated click history. To address these challenges, we introduce SetRank, an unsupervised ranking framework that models inter-entity relationships and captures entity type information. Furthermore, we develop a novel unsupervised model selection algorithm, based on the technique of weighted rank aggregation, to automatically choose the parameter settings in SetRank without resorting to a labeled validation set. We evaluate our proposed unsupervised approach using datasets from TREC Genomics Tracks and Semantic Scholar's query log. The experiments demonstrate that SetRank significantly outperforms the baseline unsupervised models, especially on entity-set queries, and our model selection algorithm effectively chooses suitable parameter settings.

Towards Better Text Understanding and Retrieval through Kernel Entity Salience Modeling

This paper presents a Kernel Entity Salience Model (KESM) that improves text understanding and retrieval by better estimating entity salience (importance) in documents. KESM represents entities by knowledge enriched distributed representations, models the interactions between entities and words by kernels, and combines the kernel scores to estimate entity salience. The whole model is learned end-to-end using entity salience labels. The salience model also improves ad hoc search accuracy, providing effective ranking features by modeling the salience of query entities in candidate documents. Our experiments on two entity salience corpora and two TREC ad hoc search datasets demonstrate the effectiveness of KESM over frequency-based and feature-based methods. We also provide examples showing how KESM conveys its text understanding ability learned from entity salience to search.

Automated Comparative Table Generation for Facilitating Human Intervention in Multi-Entity Resolution

Entity resolution (ER), the process of identifying entities that refer to the same real-world object, has long been studied in the knowledge graph (KG) community, among many others. Humans, as a valuable source of background knowledge, are increasingly getting involved in this loop by crowdsourcing and active learning, where presenting condensed and easily-compared information is vital to help human intervene in an ER task. However, current methods for single entity or pairwise summarization cannot well support humans to observe and compare multiple entities simultaneously, which impairs the efficiency and accuracy of human intervention. In this paper, we propose an automated approach to select a few important properties and values for a set of entities, and assemble them by a comparative table. We formulate several optimization problems for generating an optimal comparative table according to intuitive goodness measures and various constraints. Our experiments on real-world datasets, comparison with related work and user study demonstrate the superior efficiency, precision and user satisfaction of our approach in multi-entity resolution (MER).

On-the-fly Table Generation

Many information needs revolve around entities, which would be better answered by summarizing results in a tabular format, rather than presenting them as a ranked list. Unlike previous work, which is limited to retrieving existing tables, we aim to answer queries by automatically compiling a table in response to a query. We introduce and address the task of on-the-fly table generation: given a query, generate a relational table that contains relevant entities (as rows) along with their key properties (as columns). This problem is decomposed into three specific subtasks: (i) core column entity ranking, (ii) schema determination, and (iii) value lookup. We employ a feature-based approach for entity ranking and schema determination, combining deep semantic features with task-specific signals. We further show that these two subtasks are not independent of each other and can assist each other in an iterative manner. For value lookup, we combine information from existing tables and a knowledge base. Using two sets of entity-oriented queries, we evaluate our approach both on the component level and on the end-to-end table generation task.

SESSION: Session 5C: New Metrics

Measuring the Utility of Search Engine Result Pages: An Information Foraging Based Measure

Web Search Engine Result Pages (SERPs) are complex responses to queries, containing many heterogeneous result elements (web results, advertisements, and specialised "answers'') positioned in a variety of layouts. This poses numerous challenges when trying to measure the quality of a SERP because standard measures were designed for homogeneous ranked lists. In this paper, we aim to measure the utility and cost of SERPs. To ground this work we adopt the \CWL framework which enables a direct comparison between different measures in the same units of measurement, i.e. expected (total) utility and cost. Within this framework, we propose a new measure based on information foraging theory, which can account for the heterogeneity of elements, through different costs, and which naturally motivates the development of a user stopping model that adapts behaviour depending on the rate of gain. This directly connects models of how people search with how we measure search, providing a number of new dimensions in which to investigate and evaluate user behaviour and performance. We perform an analysis over~1000 popular queries issued to a major search engine, and report the aggregate utility experienced by users over time. Then in an comparison against common measures, we show that the proposed foraging based measure provides a more accurate reflection of the utility and of observed behaviours (stopping rank and time spent).

How Well do Offline and Online Evaluation Metrics Measure User Satisfaction in Web Image Search?

Comparing to general Web search engines, image search engines present search results differently, with two-dimensional visual image panel for users to scroll and browse quickly. These differences in result presentation can significantly impact the way that users interact with search engines, and therefore affect existing methods of search evaluation. Although different evaluation metrics have been thoroughly studied in the general Web search environment, how those offline and online metrics reflect user satisfaction in the context of image search is an open question. To shed light on this, we conduct a laboratory user study that collects both explicit user satisfaction feedbacks as well as user behavior signals such as clicks. Based on the combination of both externally assessed topical relevance and image quality judgments, offline image search metrics can be better correlated with user satisfaction than merely using topical relevance. We also demonstrate that existing offline Web search metrics can be adapted to evaluate on a two-dimensional presentation for image search. With respect to online metrics, we find that those based on image click information significantly outperform offline metrics. To our knowledge, our work is the first to thoroughly establish the relationship between different measures and user satisfaction in image search.

An Axiomatic Analysis of Diversity Evaluation Metrics: Introducing the Rank-Biased Utility Metric

Many evaluation metrics have been defined to evaluate the effectiveness ad-hoc retrieval and search result diversification systems. However, it is often unclear which evaluation metric should be used to analyze the performance of retrieval systems given a specific task. Axiomatic analysis is an informative mechanism to understand the fundamentals of metrics and their suitability for particular scenarios. In this paper, we define a constraint-based axiomatic framework to study the suitability of existing metrics in search result diversification scenarios. The analysis informed the definition of Rank-Biased Utility (RBU) -- an adaptation of the well-known Rank-Biased Precision metric -- that takes into account redundancy and the user effort associated to the inspection of documents in the ranking. Our experiments over standard diversity evaluation campaigns show that the proposed metric captures quality criteria reflected by different metrics, being suitable in the absence of knowledge about particular features of the scenario under study.

SESSION: Session 5D: Recommender Systems - Applications

Cross-language Citation Recommendation via Hierarchical Representation Learning on Heterogeneous Graph

While the volume of scholarly publications has increased at a frenetic pace, accessing and consuming the useful candidate papers, in very large digital libraries, is becoming an essential and challenging task for scholars. Unfortunately, because of language barrier, some scientists (especially the junior ones or graduate students who do not master other languages) cannot efficiently locate the publications hosted in a foreign language repository. In this study, we propose a novel solution, cross-language citation recommendation via Hierarchical Representation Learning on Heterogeneous Graph (HRLHG), to address this new problem. HRLHG can learn a representation function by mapping the publications, from multilingual repositories, to a low-dimensional joint embedding space from various kinds of vertexes and relations on a heterogeneous graph. By leveraging both global (task specific) plus local (task independent) information as well as a novel supervised hierarchical random walk algorithm, the proposed method can optimize the publication representations by maximizing the likelihood of locating the important cross-language neighborhoods on the graph. Experiment results show that the proposed method can not only outperform state-of-the-art baseline models, but also improve the interpretability of the representation model for cross-language citation recommendation task.

Attentive Group Recommendation

Due to the prevalence of group activities in people's daily life, recommending content to a group of users becomes an important task in many information systems. A fundamental problem in group recommendation is how to aggregate the preferences of group members to infer the decision of a group. Toward this end, we contribute a novel solution, namely AGREE (short for ''Attentive Group REcommEndation''), to address the preference aggregation problem by learning the aggregation strategy from data, which is based on the recent developments of attention network and neural collaborative filtering (NCF). Specifically, we adopt an attention mechanism to adapt the representation of a group, and learn the interaction between groups and items from data under the NCF framework. Moreover, since many group recommender systems also have abundant interactions of individual users on items, we further integrate the modeling of user-item interactions into our method. Through this way, we can reinforce the two tasks of recommending items for both groups and users. By experimenting on two real-world datasets, we demonstrate that our AGREE model not only improves the group recommendation performance but also enhances the recommendation for users, especially for cold-start users that have no historical interactions individually.

Calendar-Aware Proactive Email Recommendation

In this paper, we study how to leverage calendar information to help with email re-finding using a zero-query prototype, Calendar-Aware Proactive Email Recommender System (CAPERS). CAPERS proactively selects and displays potentially useful emails to users based on their upcoming calendar events with a particular focus on meeting preparation. We approach this problem domain through a survey, a task-based experiment, and a field experiment comparing multiple email recommenders in a large technology company. We first show that a large proportion of email access is related to meetings and then study the effects of four email recommenders on user perception and engagement taking into account four categories of factors: the amount of email content, email recency, calendar-email content match, and calendar-email people match. We demonstrate that these factors all positively predict the usefulness of emails to meeting preparation and that calendar-email content match is the most important. We study the effects of different machine learning models for predicting usefulness and find that an online-learned linear model doubles user engagement compared with the baselines, which suggests the benefit of continuous online learning.

Structuring Wikipedia Articles with Section Recommendations

Sections are the building blocks of Wikipedia articles. They enhance readability and can be used as a structured entry point for creating and expanding articles. Structuring a new or already existing Wikipedia article with sections is a hard task for humans, especially for newcomers or less experienced editors, as it requires significant knowledge about how a well-written article looks for each possible topic. Inspired by this need, the present paper defines the problem of section recommendation for Wikipedia articles and proposes several approaches for tackling it. Our systems can help editors by recommending what sections to add to already existing or newly created Wikipedia articles. Our basic paradigm is to generate recommendations by sourcing sections from articles that are similar to the input article. We explore several ways of defining similarity for this purpose (based on topic modeling, collaborative filtering, and Wikipedia's category system). We use both automatic and human evaluation approaches for assessing the performance of our recommendation system, concluding that the category-based approach works best, achieving [email protected] of about 80% in the human evaluation.

SESSION: Session 6A: Evaluation

On Fine-Grained Relevance Scales

In Information Retrieval evaluation, the classical approach of adopting binary relevance judgments has been replaced by multi-level relevance judgments and by gain-based metrics leveraging such multi-level judgment scales. Recent work has also proposed and evaluated unbounded relevance scales by means of Magnitude Estimation (ME) and compared them with multi-level scales. While ME brings advantages like the ability for assessors to always judge the next document as having higher or lower relevance than any of the documents they have judged so far, it also comes with some drawbacks. For example, it is not a natural approach for human assessors to judge items as they are used to do on the Web (e.g., 5-star rating). In this work, we propose and experimentally evaluate a bounded and fine-grained relevance scale having many of the advantages and dealing with some of the issues of ME. We collect relevance judgments over a 100-level relevance scale (S100) by means of a large-scale crowdsourcing experiment and compare the results with other relevance scales (binary, 4-level, and ME) showing the benefit of fine-grained scales over both coarse-grained and unbounded scales as well as highlighting some new results on ME. Our results show that S100 maintains the flexibility of unbounded scales like ME in providing assessors with ample choice when judging document relevance (i.e., assessors can fit relevance judgments in between of previously given judgments). It also allows assessors to judge on a more familiar scale (e.g., on 10 levels) and to perform efficiently since the very first judging task.

Automatic Ground Truth Expansion for Timeline Evaluation

The development of automatic systems that can produce timeline summaries by filtering high-volume streams of text documents, retaining only those that are relevant to a particular information need (e.g. topic or event), remains a very challenging task. To advance the field of automatic timeline generation, robust and reproducible evaluation methodologies are needed. To this end, several evaluation metrics and labeling methodologies have recently been developed - focusing on information nugget or cluster-based ground truth representations, respectively. These methodologies rely on human assessors manually mapping timeline items (e.g. tweets) to an explicit representation of what information a 'good' summary should contain. However, while these evaluation methodologies produce reusable ground truth labels, prior works have reported cases where such labels fail to accurately estimate the performance of new timeline generation systems due to label incompleteness. In this paper, we first quantify the extent to which timeline summary ground truth labels fail to generalize to new summarization systems, then we propose and evaluate new automatic solutions to this issue. In particular, using a depooling methodology over 21 systems and across three high-volume datasets, we quantify the degree of system ranking error caused by excluding those systems when labeling. We show that when considering lower-effectiveness systems, the test collections are robust (the likelihood of systems being miss-ranked is low). However, we show that the risk of systems being miss-ranked increases as the effectiveness of systems held-out from the pool increases. To reduce the risk of miss-ranking systems, we also propose two different automatic ground truth label expansion techniques. Our results show that our proposed expansion techniques can be effective for increasing the robustness of the TREC-TS test collections, markedly reducing the number of miss-rankings by up to 50% on average among the scenarios tested.

Stochastic Simulation of Test Collections: Evaluation Scores

Part of Information Retrieval evaluation research is limited by the fact that we do not know the distributions of system effectiveness over the populations of topics and, by extension, their true mean scores. The workaround usually consists in resampling topics from an existing collection and approximating the statistics of interest with the observations made between random subsamples, as if one represented the population and the other a random sample. However, this methodology is clearly limited by the availability of data, the impossibility to control the properties of these data, and the fact that we do not really measure what we intend to. To overcome these limitations, we propose a method based on vine copulas for stochastic simulation of evaluation results where the true system distributions are known upfront. In the basic use case, it takes the scores from an existing collection to build a semi-parametric model representing the set of systems and the population of topics, which can then be used to make realistic simulations of the scores by the same systems but on random new topics. Our ability to simulate this kind of data not only eliminates the current limitations, but also offers new opportunities for research. As an example, we show the benefits of this approach in two sample applications replicating typical experiments found in the literature. We provide a full R package to simulate new data following the proposed method, which can also be used to fully reproduce the results in this paper.

Offline Comparative Evaluation with Incremental, Minimally-Invasive Online Feedback

We investigate the use of logged user interaction data---queries and clicks---for offline evaluation of new search systems in the context of counterfactual analysis. The challenge of evaluating a new ranker against log data collected from a static production ranker is that new rankers may retrieve documents that have never been seen in the logs before, and thus lack any logged feedback from users. Additionally, the ranker itself could bias user actions such that even documents that have been seen in the logs would have exhibited different interaction patterns had they been retrieved and ranked by the new ranker. We present a methodology for incrementally logging interactions on previously-unseen documents for use in computation of an unbiased estimator of a new ranker's effectiveness. Our method is very lightly invasive with respect to the production ranker results to insure against users becoming dissatisfied if the new ranker is poor. We demonstrate how well our methods work in a simulation environment designed to be challenging for such methods to argue that they are likely to work in a wide variety of scenarios.

SESSION: Session 6B: Hashing & Embedding

BiNE: Bipartite Network Embedding

This work develops a representation learning method for bipartite networks. While existing works have developed various embedding methods for network data, they have primarily focused on homogeneous networks in general and overlooked the special properties of bipartite networks. As such, these methods can be suboptimal for embedding bipartite networks. In this paper, we propose a new method named BiNE, short for Bipartite Network Embedding, to learn the vertex representations for bipartite networks. By performing biased random walks purposefully, we generate vertex sequences that can well preserve the long-tail distribution of vertices in the original bipartite network. We then propose a novel optimization framework by accounting for both the explicit relations (i.e., observed links) and implicit relations (i.e., unobserved but transitive links) in learning the vertex representations. We conduct extensive experiments on several real datasets covering the tasks of link prediction (classification), recommendation (personalized ranking), and visualization. Both quantitative results and qualitative analysis verify the effectiveness and rationality of our BiNE method.

Deep Domain Adaptation Hashing with Adversarial Learning

The recent advances in deep neural networks have demonstrated high capability in a wide variety of scenarios. Nevertheless, fine-tuning deep models in a new domain still requires a significant amount of labeled data despite expensive labeling efforts. A valid question is how to leverage the source knowledge plus unlabeled or only sparsely labeled target data for learning a new model in target domain. The core problem is to bring the source and target distributions closer in the feature space. In the paper, we facilitate this issue in an adversarial learning framework, in which a domain discriminator is devised to handle domain shift. Particularly, we explore the learning in the context of hashing problem, which has been studied extensively due to its great efficiency in gigantic data. Specifically, a novel Deep Domain Adaptation Hashing with Adversarial learning (DeDAHA) architecture is presented, which mainly consists of three components: a deep convolutional neural networks (CNN) for learning basic image/frame representation followed by an adversary stream on one hand to optimize the domain discriminator, and on the other, to interact with each domain-specific hashing stream for encoding image representation to hash codes. The whole architecture is trained end-to-end by jointly optimizing two types of losses, i.e., triplet ranking loss to preserve the relative similarity ordering in the input triplets and adversarial loss to maximally fool the domain discriminator with the learnt source and target feature distributions. Extensive experiments are conducted on three domain transfer tasks, including cross-domain digits retrieval, image to image and image to video transfers, on several benchmarks. Our DeDAHA framework achieves superior results when compared to the state-of-the-art techniques.

Fast Scalable Supervised Hashing

Despite significant progress in supervised hashing, there are three common limitations of existing methods. First, most pioneer methods discretely learn hash codes bit by bit, making the learning procedure rather time-consuming. Second, to reduce the large complexity of the n by n pairwise similarity matrix, most methods apply sampling strategies during training, which inevitably results in information loss and suboptimal performance; some recent methods try to replace the large matrix with a smaller one, but the size is still large. Third, among the methods that leverage the pairwise similarity matrix, most of them only encode the semantic label information in learning the hash codes, failing to fully capture the characteristics of data. In this paper, we present a novel supervised hashing method, called Fast Scalable Supervised Hashing (FSSH), which circumvents the use of the large similarity matrix by introducing a pre-computed intermediate term whose size is independent with the size of training data. Moreover, FSSH can learn the hash codes with not only the semantic information but also the features of data. Extensive experiments on three widely used datasets demonstrate its superiority over several state-of-the-art methods in both accuracy and scalability. Our experiment codes are available at: https://lcbwlx.wixsite.com/fssh.

SESSION: Session 6C: Knowledge Bases/Graphs

Enriching Taxonomies With Functional Domain Knowledge

The rising need to harvest domain specific knowledge in several applications is largely limited by the ability to dynamically grow structured knowledge representations, due to the increasing emergence of new concepts and their semantic relationships with existing ones. Such enrichment of existing hierarchical knowledge sources with new information to better model the "changing world" presents two-fold challenges: (1) Detection of previously unknown entities or concepts, and (2) Insertion of the new concepts into the knowledge structure, respecting the semantic integrity of the created relationships. To this end we propose a novel framework, ETF, to enrich large-scale, generic taxonomies with new concepts from resources such as news and research publications. Our approach learns a high-dimensional embedding for the existing concepts of the taxonomy, as well as for the new concepts. During the insertion of a new concept, this embedding is used to identify semantically similar neighborhoods within the existing taxonomy. The potential parent-child relationships linking the new concepts to the existing ones are then predicted using a set of semantic and graph features. Extensive evaluation of ETF on large, real-world taxonomies of Wikipedia and WordNet showcase more than 5% F1-score improvements compared to state-of-the-art baselines. We further demonstrate that ETF can accurately categorize newly emerging concepts and question-answer pairs across different domains.

On Link Prediction in Knowledge Bases: Max-K Criterion and Prediction Protocols

Building knowledge base embedding models for link prediction has achieved great success. We however argue that the conventional top-k criterion used for evaluating the model performance is inappropriate. This paper introduces a new criterion, referred to as max-k. Through theoretical analysis and experimental study, we show that the top-k criterion is fundamentally inferior to max-k. We also introduce two prediction protocols for the max-k criterion. These protocols are strongly justified theoretically. Various insights concerning the max-k criterion and the two protocols are obtained through extensive experiments.

Weakly-supervised Contextualization of Knowledge Graph Facts

Knowledge graphs (KGs) model facts about the world; they consist of nodes (entities such as companies and people) that are connected by edges (relations such as founderOf ). Facts encoded in KGs are frequently used by search applications to augment result pages. When presenting a KG fact to the user, providing other facts that are pertinent to that main fact can enrich the user experience and support exploratory information needs. \em KG fact contextualization is the task of augmenting a given KG fact with additional and useful KG facts. The task is challenging because of the large size of KGs; discovering other relevant facts even in a small neighborhood of the given fact results in an enormous amount of candidates. We introduce a neural fact contextualization method (\em NFCM ) to address the KG fact contextualization task. NFCM first generates a set of candidate facts in the neighborhood of a given fact and then ranks the candidate facts using a supervised learning to rank model. The ranking model combines features that we automatically learn from data and that represent the query-candidate facts with a set of hand-crafted features we devised or adjusted for this task. In order to obtain the annotations required to train the learning to rank model at scale, we generate training data automatically using distant supervision on a large entity-tagged text corpus. We show that ranking functions learned on this data are effective at contextualizing KG facts. Evaluation using human assessors shows that it significantly outperforms several competitive baselines.

SESSION: Session 6D: Mobile User Behavior

Constructing Click Models for Mobile Search

Users' click-through behavior is considered as a valuable yet noisy source of implicit relevance feedback for web search engines. A series of click models have therefore been proposed to extract accurate and unbiased relevance feedback from click logs. Previous works have shown that users' search behaviors in mobile and desktop scenarios are rather different in many aspects, therefore, the click models that were designed for desktop search may not be as effective in mobile context. To address this problem, we propose a novel Mobile Click Model (MCM) that models how users examine and click search results on mobile SERPs. Specifically, we incorporate two biases that are prevalent in mobile search into existing click models: 1) the click necessity bias that some results can bring utility and usefulness to users without being clicked; 2) the examination satisfaction bias that a user may feel satisfied and stop searching after examining a result with low click necessity. Extensive experiments on large-scale real mobile search logs show that: 1) MCM outperforms existing models in predicting users' click behavior in mobile search; 2) MCM can extract richer information, such as the click necessity of search results and the probability of user satisfaction, from mobile click logs. With this information, we can estimate the quality of different vertical results and improve the ranking of heterogeneous results in mobile search.

Update Delivery Mechanisms for Prospective Information Needs: An Analysis of Attention in Mobile Users

Real-time summarization systems that monitor document streams to identify relevant content have a few options for delivering system updates to users. In a mobile context, systems could send push notifications to users' mobile devices, hoping to grab their attention immediately. Alternatively, systems could silently deposit updates into "inboxes" that users can access at their leisure. We refer to these mechanisms as push-based vs. pull-based, and present a two-year contrastive study that attempts to understand the effects of the delivery mechanism on mobile user behavior, in the context of the TREC Real-Time Summarization Tracks. Through a cluster analysis, we are able to identify three distinct and coherent patterns of behavior. As expected, we find that users are likely to ignore push notifications, but for those updates that users do pay attention to, content is consumed within a short amount of time. Interestingly, users bombarded with push notifications are less likely to consume updates on their own initiative and less likely to engage in long reading sessions---which is a common pattern for users who pull content from their inboxes. We characterize users as exhibiting "eager" or "apathetic" information consumption behavior as an explanation of these observations, and attempt to operationalize our findings into design recommendations.

SESSION: Session 7A: Crowdsourcing & Assessment

Item Retrieval as Utility Estimation

Retrieval systems have greatly improved over the last half century, estimating relevance to a latent user need in a wide variety of areas. One common task in e-commerce and science that has not enjoyed such advancements is searching through a catalog of items. Finding a desirable item in such a catalog requires that the user specify desirable item properties, specifically desirable attribute values. Existing item retrieval systems assume the user can formulate a good Boolean or SQL-style query to retrieve items, as one would do with a database, but this is often challenging, particularly given multiple numeric attributes. Such systems avoid inferring query intent, instead requiring the user to precisely specify what matches the query. A contrasting approach would be to estimate how well items match the user's latent desires and return items ranked by this estimation. Towards this end, we present a retrieval model inspired by multi-criteria decision making theory, concentrating on numeric attributes. In two user studies (choosing airline tickets and meal plans) using Amazon Mechanical Turk, we evaluate our novel approach against the de facto standard of Boolean retrieval and several models proposed in the literature. We use a novel competitive game to motivate test subjects and compare methods based on the results of the subjects' initial query and their success in the game. In our experiments, our new method significantly outperformed the others, whereas the Boolean approaches had the worst performance.

Crowd vs. Expert: What Can Relevance Judgment Rationales Teach Us About Assessor Disagreement?

While crowdsourcing offers a low-cost, scalable way to collect relevance judgments, lack of transparency with remote crowd work has limited understanding about the quality of collected judgments. In prior work, we showed a variety of benefits from asking crowd workers to provide \em rationales for each relevance judgment \citemcdonnell2016relevant. In this work, we scale up our rationale-based judging design to assess its reliability on the 2014 TREC Web Track, collecting roughly 25K crowd judgments for 5K document-topic pairs. We also study having crowd judges perform topic-focused judging, rather than across topics, finding this improves quality. Overall, we show that crowd judgments can be used to reliably rank IR systems for evaluation. We further explore the potential of rationales to shed new light on reasons for judging disagreement between experts and crowd workers. Our qualitative and quantitative analysis distinguishes subjective vs.\ objective forms of disagreement, as well as the relative importance of each disagreement cause, and we present a new taxonomy for organizing the different types of disagreement we observe. We show that many crowd disagreements seem valid and plausible, with disagreement in many cases due to judging errors by the original TREC assessors. We also share our WebCrowd25k dataset, including: (1) crowd judgments with rationales, and (2) taxonomy category labels for each judging disagreement analyzed.

SESSION: Session 7B: Content & Semantics

CAN: Enhancing Sentence Similarity Modeling with Collaborative and Adversarial Network

The neural networks have attracted great attention for sentence similarity modeling in recent years. Most neural networks focus on the representation of each sentence, while the common features of a sentence pair are not well studied. In this paper, we propose a Collaborative and Adversarial Network (CAN), which explicitly models the common features between two sentences for enhancing sentence similarity modeling. To be specific, a common feature extractor is presented and embedded into our CAN model, which includes a generator and a discriminator playing a collaborative and adversarial game for common feature extraction. Experiments on three benchmark datasets, namely TREC-QA and WikiQA for answer selection and MSRP for paraphrase identification, show that our proposed model is effective to boost the performance of sentence similarity modeling. In particular, our proposed model outperforms the state-of-the-art approaches on TREC-QA without using any external resources or pre-training. For the other two datasets, our model is also comparable to if not better than the recent neural network approaches.

Identify Shifts of Word Semantics through Bayesian Surprise

Much work has been done recently on learning word embeddings from large corpora, which attempts to find the coordinates of words in a static and high dimensional semantic space. In reality, such corpora often span a sufficiently long time period, during which the meanings of many words may have changed. The co-evolution of word meanings may also result in a distortion of the semantic space, making these static embeddings unable to accurately represent the dynamics of semantics. In this paper, we present a novel computational method to capture such changes and to model the evolution of word semantics. Distinct from existing approaches that learn word embeddings independently from time periods and then align them, our method explicitly establishes the stable topological structure of word semantics and identifies the surprising changes in the semantic space over time through a principled statistical method. Empirical experiments on large-scale real-world corpora demonstrate the effectiveness of the proposed approach, which outperforms the state-of-the-art by a large margin.

From Royals to Vegans: Characterizing Question Trolling on a Community Question Answering Website

The phenomenon of trolling has emerged as a widespread form of abuse on news sites, online social networks, and other types of social media. In this paper, we study a particular type of trolling, performed by asking a provocative question on a community question-answering website. By combining user reports with subsequent moderator deletions, we identify a set of over 400,000 troll questions on Yahoo Answers, i.e., questions aimed to inflame, upset, and draw attention from others on the community. This set of troll questions spans a lengthy period of time and a diverse set of topical categories. Our analysis reveals unique characteristics of troll questions when compared to "regular" questions, with regards to their metadata, text, and askers. A classifier built upon these features reaches an accuracy of 85% over a balanced dataset. The answers' text and metadata, reflecting the community's response to the question, are found particularly productive for the classification task.

SESSION: Session 7C: Interfaces

Ranking for Relevance and Display Preferences in Complex Presentation Layouts

Learning to Rank has traditionally considered settings where given the relevance information of objects, the desired order in which to rank the objects is clear. However, with today's large variety of users and layouts this is not always the case. In this paper, we consider so-called complex ranking settings where it is not clear what should be displayed, that is, what the relevant items are, and how they should be displayed, that is, where the most relevant items should be placed. These ranking settings are complex as they involve both traditional ranking and inferring the best display order. Existing learning to rank methods cannot handle such complex ranking settings as they assume that the display order is known beforehand. To address this gap we introduce a novel Deep Reinforcement Learning method that is capable of learning complex rankings, both the layout and the best ranking given the layout, from weak reward signals. Our proposed method does so by selecting documents and positions sequentially, hence it ranks both the documents and positions, which is why we call it the Double Rank Model (DRM). Our experiments show that DRM outperforms all existing methods in complex ranking settings, thus it leads to substantial ranking improvements in cases where the display order is not known a priori.

Natural Language Interfaces with Fine-Grained User Interaction: A Case Study on Web APIs

The rapidly increasing ubiquity of computing puts a great demand on next-generation human-machine interfaces. Natural language interfaces, exemplified by virtual assistants like Apple Siri and Microsoft Cortana, are widely believed to be a promising direction. However, current natural language interfaces provide users with little help in case of incorrect interpretation of user commands. We hypothesize that the support of fine-grained user interaction can greatly improve the usability of natural language interfaces. In the specific setting of natural language interface to web APIs, we conduct a systematic study to verify our hypothesis. To facilitate this study, we propose a novel modular sequence-to-sequence model to create interactive natural language interfaces. By decomposing the complex prediction process of a typical sequence-to-sequence model into small, highly-specialized prediction units called modules, it becomes straightforward to explain the model prediction to the user, and solicit user feedback to correct possible prediction errors at a fine-grained level. We test our hypothesis by comparing an interactive natural language interface with its non-interactive version through both simulation and human subject experiments with real-world APIs. We show that with the interactive natural language interface, users can achieve a higher success rate and a lower task completion time, which lead to greatly improved user satisfaction.

SESSION: Short Research Papers I

Pytrec_eval: An Extremely Fast Python Interface to trec_eval

We introduce pytrec_eval, a Python interface to the trec_eval information retrieval evaluation toolkit. pytrec_eval exposes the reference implementations of trec_eval within Python as a native extension. We show that pytrec_eval is around one order of magnitude faster than invoking trec_eval as a sub process from within Python. Compared to a native Python implementation of NDCG, pytrec_eval is twice as fast for practically-sized rankings. Finally, we demonstrate its effectiveness in an application where pytrec_eval is combined with Pyndri and the OpenAI Gym where query expansion is learned using Q-learning.

Split-Lists and Initial Thresholds for WAND-based Search

We examine search engine performance for rank-safe query execution using the WAND and state-of-the-art BMW algorithms. Supported by extensive experiments, we suggest two approaches to improve query performance: initial list thresholds should be used when k values are large, and our split-list WAND approach should be used instead of the normal WAND or BMW approaches. We also recommend that reranking-based distributed systems use smaller k values when selecting the results to return from each partition.

Ontology Evaluation with Path-based Text-aware Entropy Computation

With the rising importance of knowledge exchange, ontologies have become a key technology in the development of shared knowledge models for semantic-driven applications, such as knowledge interchange and semantic integration. Significant progress has been made in the use of entropy to measure the predictability and redundancy of knowledge bases, particularly ontologies. However, the current entropy applications used to evaluate ontologies consider only single-point connectivity rather than path connectivity, assign equal weights to each entity and path, and assume that vertices are static. To address these deficiencies, the present study proposes a Path-based Text-aware Entropy Computation method, PTEC, by considering the path information between different vertices and the textual information within the path to calculate the connectivity path of the whole network and the different weights between various nodes. Information obtained from structure-based embedding and text-based embedding is multiplied by the connectivity matrix of the entropy computation. An experimental evaluation of three real-world ontologies is performed based on ontology statistical information (data quantity), entropy evaluation (data quality), and a case study (ontology structure and text visualization). These aspects mutually demonstrate the reliability of our method. Experimental results demonstrate that PTEC can effectively evaluate ontologies, particularly those in the medical field.

Reducing Variance in Gradient Bandit Algorithm using Antithetic Variates Method

Policy gradient, which makes use of Monte Carlo method to get an unbiased estimation of the parameter gradients, has been widely used in reinforcement learning. One key issue in policy gradient is reducing the variance of the estimation. From the viewpoint of statistics, policy gradient with baseline, a successful variance reduction method for policy gradient, directly applies the control variates method, a traditional variance reduction technique used in Monte Carlo, to policy gradient. One problem with control variates method is that the quality of estimation heavily depends on the choice of the control variates. To address the issue and inspired by the antithetic variates method for variance reduction, we propose to combine the antithetic variates method with traditional policy gradient for the multi-armed bandit problem. Furthermore, we achieve a new policy gradient algorithm called Antithetic-Arm Bandit (AAB). In AAB, the gradient is estimated through coordinate ascent where at each iteration gradient of the target arm is estimated through: 1) constructing a sequence of arms which is approximately monotonic in terms of estimated gradients, 2) sampling a pair of antithetic arms over the sequence, and 3) re-estimating the target gradient based on the sampled pair. Theoretical analysis proved that AAB achieved an unbiased and variance reduced estimation. Experimental results based on a multi-armed bandit task showed that AAB can achieve state-of-the-art performances.

A Flexible Forecasting Framework for Hierarchical Time Series with Seasonal Patterns: A Case Study of Web Traffic

In this work, we focus on models and analysis of multivariate time series data that are organized in hierarchies. Such time series are referred to as hierarchical time series (HTS) and they are very common in business, management, energy consumption, social networks, or web traffic modeling and analysis. We propose a new flexible hierarchical forecasting framework, that takes advantage of the hierarchical relational structure to predict individual time series. Our new forecasting framework is able to (1) handle HTS modeling and forecasting problems; (2) make accurate forecasting for HTS with seasonal patterns; (3) incorporate various individual forecasting models and combine heuristics based on the HTS datasets' own characterization. The proposed framework is evaluated on a real-world web traffic data set. The results demonstrate that our approach is superior when applied to hierarchical web traffic prediction problems, and it outperforms alternative time series prediction models in terms of accuracy

Query Performance Prediction using Passage Information

We focus on the post-retrieval query performance prediction (QPP) task. Specifically, we make a new use of passage information for this task. Using such information we derive a new mean score calibration predictor that provides a more accurate prediction. Using an empirical evaluation over several common TREC benchmarks, we show that, QPP methods that only make use of document-level features are mostly suited for short query prediction tasks; while such methods perform significantly worse in verbose query prediction settings. We further demonstrate that, QPP methods that utilize passage-information are much better suited for verbose settings. Moreover, our proposed predictor, which utilizes both document-level and passage-level features provides a more accurate and consistent prediction for both types of queries. Finally, we show a connection between our predictor and a recently proposed supervised QPP method, which results in an enhanced prediction.

Users, Adaptivity, and Bad Abandonment

We consider two recent proposals for effectiveness metrics that have been argued to be adaptive, those of Moffat et al. (ACM TOIS, 2017) and Jiang and Allan (CIKM, 2017), and consider the user interaction models that they give rise to. By categorizing non-relevant documents into those that are plausibly non-relevant and those that are egregiously non-relevant, we capture all of the attributes incorporated into the two proposals, and hence develop an effectiveness metric that better reflects user behavior when viewing the SERP, including bad abandonment.

Knowledge-aware Attentive Neural Network for Ranking Question Answer Pairs

Ranking question answer pairs has attracted increasing attention recently due to its broad applications such as information retrieval and question answering (QA). Significant progresses have been made by deep neural networks. However, background information and hidden relations beyond the context, which play crucial roles in human text comprehension, have received little attention in recent deep neural networks that achieve the state of the art in ranking QA pairs. In the paper, we propose KABLSTM, a Knowledge-aware Attentive Bidirectional Long Short-Term Memory, which leverages external knowledge from knowledge graphs (KG) to enrich the representational learning of QA sentences. Specifically, we develop a context-knowledge interactive learning architecture, in which a context-guided attentive convolutional neural network (CNN) is designed to integrate knowledge embeddings into sentence representations. Besides, a knowledge-aware attention mechanism is presented to attend interrelations between each segments of QA pairs. KABLSTM is evaluated on two widely-used benchmark QA datasets: WikiQA and TREC QA. Experiment results demonstrate that KABLSTM has robust superiority over competitors and sets state-of-the-art.

A Living Lab Study of Query Amendment in Job Search

Errors in formulation of queries made by users can lead to poor search results pages. We performed a living lab study using online A/B testing to measure the degree of improvement achieved with a query amendment technique when applied to a commercial job search engine. Of particular interest in this case study is a clear 'success' signal, namely, the number of job applications lodged by a user as a result of querying the service. A set of 276 queries was identified for amendment in four different categories through the use of word embeddings, with large gains in conversion rates being attained in all four of those categories. Our analysis of query reformulations also provides a better understanding of user satisfaction in the case of problematic queries (ones with fewer results than fill a single page) by observing that users tend to reformulate rewritten queries less.

Attention-driven Factor Model for Explainable Personalized Recommendation

Latent Factor Models (LFMs) based on Collaborative Filtering (CF) have been widely applied in many recommendation systems, due to their good performance of prediction accuracy. In addition to users' ratings, auxiliary information such as item features is often used to improve performance, especially when ratings are very sparse. To the best of our knowledge, most existing LFMs integrate different item features in the same way for all users. Nevertheless, the attention on different item attributes varies a lot from user to user. For personalized recommendation, it is valuable to know what feature of an item a user cares most about. Besides, the latent vectors used to represent users or items in LFMs have few explicit meanings, which makes it difficult to explain why an item is recommended to a specific user. In this work, we propose the Attention-driven Factor Model (AFM), which can not only integrate item features driven by users' attention but also help answer this "why". To estimate users' attention distributions on different item features, we propose the Gated Attention Units (GAUs) for AFM. The GAUs make it possible to let the latent factors "talk", by generating user attention distributions from user latent vectors. With users' attention distributions, we can tune the weights of item features for different users. Moreover, users' attention distributions can also serve as explanations for our recommendations. Experiments on several real-world datasets demonstrate the advantages of AFM (using GAUs) over competitive baseline algorithms on rating prediction.

IRevalOO: An Object Oriented Framework for Retrieval Evaluation

We propose IRevalOO, a flexible Object Oriented framework that (i) can be used as-is as a replacement of the widely adopted trec\_eval software, and (ii) can be easily extended (or "instantiated'', in framework terminology) to implement different scenarios of test collection based retrieval evaluation. Instances of IRevalOO can provide a usable and convenient alternative to the state-of-the-art software commonly used by different initiatives (TREC, NTCIR, CLEF, FIRE, etc.). Also, those instances can be easily adapted to satisfy future customization needs of researchers, as: implementing and experimenting with new metrics, even based on new notions of relevance; using different formats for system output and "qrels''; and in general visualizing, comparing, and managing retrieval evaluation results.

Translating Representations of Knowledge Graphs with Neighbors

Knowledge graph completion is a critical issue because many applications benefit from their structural and rich resources. In this paper, we propose a method named TransN, which consid- ers the dependencies between triples and incorporates neighbor information dynamically. In experiments, we evaluate our model by link prediction and also conduct several qualitative analyses to prove effectiveness. Experimental results show that our model could integrate neighbor information effectively and outperform state-of-the-art models.

Index Compression for BitFunnel Query Processing

Large-scale search engines utilize inverted indexes which store ordered lists of document identifies (docIDs) relevant to query terms, which can be queried thousands of times per second. In order to reduce storage requirements, we propose a dictionary-based compression approach for the recently proposed bitwise data-structure BitFunnel, which makes use of a Bloom filter. Compression is achieved through storing frequently occurring blocks in a dictionary. Infrequently occurring blocks (those which are not represented in the dictionary) are instead referenced using similar blocks that are in the dictionary, introducing additional false positive errors. We further introduce a docID reordering strategy to improve compression. Experimental results indicate an improvement in compression by 27% to 30%, at the expense of increasing the query processing time by 16% to 48% and increasing the false positive rate by around 7.6 to 10.7 percentage points.

Large-Scale Image Retrieval with Elasticsearch

Content-Based Image Retrieval in large archives through the use of visual features has become a very attractive research topic in recent years. The cause of this strong impulse in this area of research is certainly to be attributed to the use of Convolutional Neural Network (CNN) activations as features and their outstanding performance. However, practically all the available image retrieval systems are implemented in main memory, limiting their applicability and preventing their usage in big-data applications. In this paper, we propose to transform CNN features into textual representations and index them with the well-known full-text retrieval engine Elasticsearch. We validate our approach on a novel CNN feature, namely Regional Maximum Activations of Convolutions. A preliminary experimental evaluation, conducted on the standard benchmark INRIA Holidays, shows the effectiveness and efficiency of the proposed approach and how it compares to state-of-the-art main-memory indexes.

A Co-Memory Network for Multimodal Sentiment Analysis

With the rapid increase of diversity and modality of data in user-generated contents, sentiment analysis as a core area of social media analytics has gone beyond traditional text-based analysis. Multimodal sentiment analysis has become an important research topic in recent years. Most of the existing work on multimodal sentiment analysis extracts features from image and text separately, and directly combine them to train a classifier. As visual and textual information in multimodal data can mutually reinforce and complement each other in analyzing the sentiment of people, previous research all ignores this mutual influence between image and text. To fill this gap, in this paper, we consider the interrelation of visual and textual information, and propose a novel co-memory network to iteratively model the interactions between visual contents and textual words for multimodal sentiment analysis. Experimental results on two public multimodal sentiment datasets demonstrate the effectiveness of our proposed model compared to the state-of-the-art methods.

Investigating User Perception of Gender Bias in Image Search: The Role of Sexism

There is growing evidence that search engines produce results that are socially biased, reinforcing a view of the world that aligns with prevalent social stereotypes. One means to promote greater transparency of search algorithms - which are typically complex and proprietary - is to raise user awareness of biased result sets. However, to date, little is known concerning how users perceive bias in search results, and the degree to which their perceptions differ and/or might be predicted based on user attributes. One particular area of search that has recently gained attention, and forms the focus of this study, is image retrieval and gender bias. We conduct a controlled experiment via crowdsourcing using participants recruited from three countries to measure the extent to which workers perceive a given image results set to be subjective or objective. Demographic information about the workers, along with measures of sexism, are gathered and analysed to investigate whether (gender) biases in the image search results can be detected. Amongst other findings, the results confirm that sexist people are less likely to detect and report gender biases in image search results.

Killing Two Birds With One Stone: Concurrent Ranking of Tags and Comments of Social Images

User-generated comments and tags can reveal important visual concepts associated with an image in Flickr. However, due to the inherent noisiness of the metadata, not all user tags are necessarily descriptive of the image. Likewise, comments may contain spam or chatter that are irrelevant to the image. Hence, identifying and ranking relevant tags and comments can boost applications such as tag-based image search, tag recommendation, etc. In this paper, we present a lightweight visual signature-based model to concurrently generate ranked lists of comments and tags of a social image based on their joint relevance to the visual features, user comments, and user tags. The proposed model is based on sparse reconstruction of the visual content of an image using its tags and comments. Through empirical study on Flickr dataset, we demonstrate the effectiveness and superiority of the proposed technique against state-of-the-art tag ranking and refinement techniques.

Imagination Based Sample Construction for Zero-Shot Learning

Zero-shot learning (ZSL) which aims to recognize unseen classes with no labeled training sample, efficiently tackles the problem of missing labeled data in image retrieval. Nowadays there are mainly two types of popular methods for ZSL to recognize images of unseen classes: probabilistic reasoning and feature projection. Different from these existing types of methods, we propose a new method: sample construction to deal with the problem of ZSL. Our proposed method, called Imagination Based Sample Construction (IBSC), innovatively constructs image samples of target classes in feature space by mimicking human associative cognition process. Based on an association between attribute and feature, target samples are constructed from different parts of various samples. Furthermore, dissimilarity representation is employed to select high-quality constructed samples which are used as labeled data to train a specific classifier for those unseen classes. In this way, zero-shot learning is turned into a supervised learning problem. As far as we know, it is the first work to construct samples for ZSL thus, our work is viewed as a baseline for future sample construction methods. Experiments on four benchmark datasets show the superiority of our proposed method.

Parameterizing Kterm Hashing

Kterm Hashing provides an innovative approach to novelty detection on massive data streams. Previous research focused on maximizing the efficiency of Kterm Hashing and succeeded in scaling First Story Detection to Twitter-size data stream without sacrificing detection accuracy. In this paper, we focus on improving the effectiveness of Kterm Hashing. Traditionally, all kterms are considered as equally important when calculating a document's degree of novelty with respect to the past. We believe that certain kterms are more important than others and hypothesize that uniform kterm weights are sub-optimal for determining novelty in data streams. To validate our hypothesis, we parameterize Kterm Hashing by assigning weights to kterms based on their characteristics. Our experiments apply Kterm Hashing in a First Story Detection setting and reveal that parameterized Kterm Hashing can surpass state-of-the-art detection accuracy and significantly outperform the uniformly weighted approach.

Technology Assisted Reviews: Finding the Last Few Relevant Documents by Asking Yes/No Questions to Reviewers

The goal of a technology-assisted review is to achieve high recall with low human effort. Continuous active learning algorithms have demonstrated good performance in locating the majority of relevant documents in a collection, however their performance is reaching a plateau when 80%-90% of them has been found. Finding the last few relevant documents typically requires exhaustively reviewing the collection. In this paper, we propose a novel method to identify these last few, but significant, documents efficiently. Our method makes the hypothesis that entities carry vital information in documents, and that reviewers can answer questions about the presence or absence of an entity in the missing relevance documents. Based on this we devise a sequential Bayesian search method that selects the optimal sequence of questions to ask. The experimental results show that our proposed method can greatly improve performance requiring less reviewing effort.

An Information Retrieval Framework for Contextual Suggestion Based on Heterogeneous Information Network Embeddings

We present an Information Retrieval framework that leverages Heterogeneous Information Network (HIN) embeddings for contextual suggestion. Our method represents users, documents and other context-related documents as heterogeneous objects in a HIN. Using meta-paths, selected based on domain knowledge, we create graph embeddings from this network, thereby learning a representation of users and objects in the same semantic vector space. This allows inferences of user interest on unseen objects based on distance in the embedding space. These object distances are then incorporated as features in a well-established learning to rank (LTR) framework. We make use of the 2016 TREC Contextual Suggestion (TRECCS) dataset, which contains user profiles in the form of relevance-rated documents, and demonstrate the competitiveness of our approach by comparing our system to the best performing systems of the TRECCS task.

Toward an Interactive Patent Retrieval Framework based on Distributed Representations

We present a novel interactive framework for patent retrieval leveraging distributed representations of concepts and entities extracted from the patents text. We propose a simple and practical interactive relevance feedback mechanism where the user is asked to annotate relevant/irrelevant results from the top n hits. We then utilize this feedback for query reformulation and term weighting where weights are assigned based on how good each term is at discriminating the relevant vs. irrelevant candidates. First, we demonstrate the efficacy of the distributed representations on the CLEF-IP 2010 dataset where we achieve significant improvement of 4.6% in recall over the keyword search baseline. Second, we simulate interactivity to demonstrate the efficacy of our interactive term weighting mechanism. Simulation results show that we can achieve significant improvement in recall from one interaction iteration outperforming previous semantic and interactive patent retrieval methods.

Consistency and Variation in Kernel Neural Ranking Model

This paper studies the consistency of the kernel-based neural ranking model K-NRM, a recent state-of-the-art neural IR model, which is important for reproducible research and deployment in the industry. We find that K-NRM has low variance on relevance-based metrics across experimental trials. In spite of this low variance in overall performance, different trials produce different document rankings for individual queries. The main source of variance in our experiments was found to be different latent matching patterns captured by K-NRM. In the IR-customized word embeddings learned by K-NRM, the query-document word pairs follow two different matching patterns that are equally effective, but align word pairs differently in the embedding space. The different latent matching patterns enable a simple yet effective approach to construct ensemble rankers, which improve K-NRM's effectiveness and generalization abilities.

Review Sentiment-Guided Scalable Deep Recommender System

Existing review-aware recommendation methods represent users (or items) through the concatenation of the reviews written by (or for) them, and depend entirely on convolutional neural networks (CNNs) to extract meaningful features for modeling users (or items). However, understanding reviews based only on the raw words of reviews is challenging because of the inherent ambiguity contained in them originated from the users' different tendency in writing. Moreover, it is inefficient in time and memory to model users/items by the concatenation of their associated reviews owing to considerably large inputs to CNNs. In this work, we present a scalable review-aware recommendation method, called SentiRec, that is guided to incorporate the sentiments of reviews when modeling the users and the items. SentiRec is a two-step approach composed of the first step that includes the encoding of each review into a fixed-size review vector that is trained to embody the sentiment of the review, followed by the second step that generates recommendations based on the vector-encoded reviews. Through our experiments, we show that SentiRec not only outperforms the existing review-aware methods, but also drastically reduces the training time and the memory usage. We also conduct a qualitative evaluation on the vector-encoded reviews trained by SentiRec to demonstrate that the overall sentiments are indeed encoded therein.

Learning to Detect Pathogenic Microorganism of Community-acquired Pneumonia

Community-acquired pneumonia (CAP) is a major death cause for children, requiring an early administration of appropriate antibiotics to cure it. To achieve this, accurate detection of pathogenic microorganism is crucial, especially for reducing the abuse of antibiotics. Conventional gold standard detection methods are mainly etiology based, incurring high cost and labor intensity. Although recently electronic health records (EHRs) become prevalent and widely used, their power for automatically determining pathogenic microorganism has not been investigated. In this paper, we formulate a new problem for automatically detecting pathogenic microorganism of CAP by considering patient biomedical features from EHRs, including time-varying body temperatures and common laboratory measurements. We further develop a Patient Attention based Recurrent Neural Network (PA-RNN) model to fuse different patient features for detection. We conduct experiments on a real dataset, demonstrating utilizing electronic health records yields promising performance and PA-RNN outperforms several alternatives.

Towards Distributed Pairwise Ranking using Implicit Feedback

Learning with pairwise ranking methods for implicit feedback datasets has shown promising results as compared to pointwise ranking methods for recommendation tasks. However, there is limited effort in scaling the pairwise ranking methods in a large scale distributed setting. In this paper we address the scalability aspect of a pairwise ranking method using Factorization Machines in distributed settings. Our proposed method is based on a block partitioning of the model parameters so that each distributed worker runs stochastic gradient updates on an independent block. We developed a dynamic block creation and exchange strategy by utilizing the frequency of occurrence of a feature in the local training data of a worker. Empirical evidence on publicly available benchmark datasets indicates that the proposed method scales better than the static block based methods and outperforms competing state-of-the-art methods.

Semantic Location in Email Query Suggestion

Mobile devices are pervasive, which means that users have access to web content and their personal documents at all locations, not just their home or office. Existing work has studied how locations can influence information needs, focusing on web queries. We explore whether or not location information can be helpful to users who are searching their own personal documents. We wish to study whether a users' location can predict their queries over their own personal data, so we focus on the task of query suggestion. While we find that using location directly can be helpful, it does not generalize well to novel locations. To improve this situation, we explore using semantic location: that is, rather than memorizing location-query associations, we generalize our location information to names of the closest point of interest. By using short, semantic descriptions of locations, we find that we can more robustly improve query completion and observe that users are already using locations to extend their own queries in this domain. We present a simple but effective model that can use location to predict queries for a user even before they type anything into a search box, and which learns effectively even when not all queries have location information.

GraphCAR: Content-aware Multimedia Recommendation with Graph Autoencoder

Precisely recommending relevant multimedia items from massive candidates to a large number of users is an indispensable yet difficult task on many platforms. A promising way is to project users and items into a latent space and recommend items via the inner product of latent factor vectors. However, previous studies paid little attention to the multimedia content itself and couldn't make the best use of preference data like implicit feedback. To fill this gap, we propose a Content-aware Multimedia Recommendation Model with Graph Autoencoder (GraphCAR), combining informative multimedia content with user-item interaction. Specifically, user-item interaction, user attributes and multimedia contents (e.g., images, videos, audios, etc.) are taken as input of the autoencoder to generate the item preference scores for each user. Through extensive experiments on two real-world multimedia Web services: Amazon and Vine, we show that GraphCAR significantly outperforms state-of-the-art techniques of both collaborative filtering and content-based methods.

Multi-level Abstraction Convolutional Model with Weak Supervision for Information Retrieval

Recent neural models for IR have produced good retrieval effectiveness compared with traditional models. Yet all of them assume that a single matching function should be used for all queries. In practice, user's queries may be of various nature which might require different levels of matching, from low level word matching to high level conceptual matching. To cope with this problem, we propose a multi-level abstraction convolutional model (MACM) that generates and aggregates several levels of matching scores. Weak supervision is used to address the problem of large training data. Experimental results demonstrated the effectiveness of our proposed MACM model.

Analyzing and Characterizing User Intent in Information-seeking Conversations

Understanding and characterizing how people interact in information-seeking conversations is crucial in developing conversational search systems. In this paper, we introduce a new dataset designed for this purpose and use it to analyze information-seeking conversations by user intent distribution, co-occurrence, and flow patterns. The MSDialog dataset is a labeled dialog dataset of question answering (QA) interactions between information seekers and providers from an online forum on Microsoft products. The dataset contains more than 2,000 multi-turn QA dialogs with 10,000 utterances that are annotated with user intent on the utterance level. Annotations were done using crowdsourcing. With MSDialog, we find some highly recurring patterns in user intent during an information-seeking process. They could be useful for designing conversational search systems. We will make our dataset freely available to encourage exploration of information-seeking conversation models.

Modeling Multidimensional User Relevance in IR using Vector Spaces

It has been shown that relevance judgment of documents is influenced by multiple factors beyond topicality. Some multidimensional user relevance models (MURM) proposed in literature have investigated the impact of different dimensions of relevance on user judgment. Our hypothesis is that a user might give more importance to certain relevance dimensions in a session which might change dynamically as the session progresses. This motivates the need to capture the weights of different relevance dimensions using feedback and build a model to rank documents for subsequent queries according to these weights. We propose a geometric model inspired by the mathematical framework of Quantum theory to capture the user's importance given to each dimension of relevance and test our hypothesis on data from a web search engine and TREC Session track

Are we on the Right Track?: An Examination of Information Retrieval Methodologies

The unpredictability of user behavior and the need for effectiveness make it difficult to define a suitable research methodology for Information Retrieval (IR). In order to tackle this challenge, we categorize existing IR methodologies along two dimensions: (1) empirical vs. theoretical, and (2) top-down vs. bottom-up. The strengths and drawbacks of the resulting categories are characterized according to 6 desirable aspects. The analysis suggests that different methodologies are complementary and therefore, equally necessary. The categorization of the 167 full papers published in the last SIGIR (2016 and 2017) and ICTIR (2017) conferences suggest that most of existing work is empirical bottom-up, suggesting lack of some desirable aspects. With the hope of improving IR research practice, we propose a general methodology for IR that integrates the strengths of existing research methods.

Fine-Grained Information Identification in Health Related Posts

Online health communities have become a medium for patients to share their personal experiences and interact with peers on topics related to a disease, medication, side effects, and therapeutic processes. Analyzing informational posts in these communities can provide an insightful view about the dominant health issues and can help patients find the information that they need easier. In this paper, we propose a computational model that mines user content in online health communities to detect positive experiences and suggestions on health improvement as well as negative impacts or side effects that cause suffering throughout fighting with a disease. Specifically, we combine high-level, abstract features extracted from a convolutional neural network with lexicon-based features and features extracted from a long short term memory network to capture the semantics in the data. We show that our model, with and without lexicon-based features, outperforms strong baselines.

Quantitative Information Extraction From Social Data

Social data is a rich data source for identifying trends and topics of interest based on user activity. Social data also provides opportunities to collect numerical data about events like elections, sport games, disasters or economic news. We propose the problem of identifying relevant quantitative information from social data as annotations for a topic. We investigate how to extract quantitative information and perform a number of experiments and analysis with Twitter data.

Measuring Influence on Instagram: A Network-Oblivious Approach

This paper focuses on the problem of scoring and ranking influential users of Instagram, a visual content sharing online social network (OSN). Instagram is the second largest OSN in the world with 700 million active Instagram accounts, 32% of all worldwide Internet users. Among the millions of users, photos shared by more influential users are viewed by more users than posts shared by less influential counterparts. This raises the question of how to identify those influential Instagram users. In our work, we present and discuss the lack of relevant tools and insufficient metrics for influence measurement, focusing on a network oblivious approach and show that the graph-based approach used in other OSNs is a poor fit for Instagram. In our study, we consider user statistics, some of which are more intuitive than others, and several regression models to measure users' influence.

Event2Vec: Neural Embeddings for News Events

Representation of news events as latent feature vectors is essential for several tasks, such as news recommendation, news event linking, etc. However, representations proposed in the past fail to capture the complex network structure of news events. In this paper we propose Event2Vec, a novel way to learn latent feature vectors for news events using a network. We use recently proposed network embedding techniques, which are proven to be very effective for various prediction tasks in networks. As events involve different classes of nodes, such as named entities, temporal information, etc, general purpose network embeddings are agnostic to event semantics. To address this problem, we propose biased random walks that are tailored to capture the neighborhoods of news events in event networks. We then show that these learned embeddings are effective for news event recommendation and news event linking tasks using strong baselines, such as vanilla Node2Vec, and other state-of-the-art graph-based event ranking techniques.

Universal Approximation Functions for Fast Learning to Rank: Replacing Expensive Regression Forests with Simple Feed-Forward Networks

Learning to rank is a key component of modern information retrieval systems. Recently, regression forest models (i.e., random forests, LambdaMART and gradient boosted regression trees) have come to dominate learning to rank systems in practice, as they provide the ability to learn from large scale data while generalizing well to additional test queries. As a result, efficient implementations of these models is a concern in production systems, as evidenced by past work. We propose an alternate method for optimizing the execution of learned models: converting these expensive ensembles to a feed-forward neural network. This simple neural architecture is quite efficient to execute: we show that the resulting chain of matrix multiplies is quite efficient while maintaining the effectiveness of the original, more-expensive forest model. Our neural approach has the advantage of being easier to train than any direct neural models, since it can match the previously-learned regression rather than learn to generalize relevance judgments directly. We observe CPU document scoring speed improvements of up to 400x over traditional algorithms and up to 10x over state-of-the-art algorithms with no measurable loss in mean average precision. With a GPU available, our algorithm is able to score every document in a batch in parallel for another 10-100x improvement. While we are not the first work to observe that neural networks are efficient as well as being effective, our application of this observation to learning to rank is novel and will have large real-world impact.

Modeling Mobile User Actions for Purchase Recommendation using Deep Memory Networks

Rapid expansion of mobile devices has brought an unprecedented opportunity for mobile operators and content publishers to reach many users at any point in time. Understanding usage patterns of mobile applications (apps) is an integral task that precedes advertising efforts of providing relevant recommendations to users. However, this task can be very arduous due to the unstructured nature of app data, with sparseness in available information. This study proposes a novel approach to learn representations of mobile user actions using Deep Memory Networks. We validate the proposed approach on millions of app usage sessions built from large scale feeds of mobile app events and mobile purchase receipts. The empirical study demonstrates that the proposed approach performed better compared to several competitive baselines in terms of recommendation precision quality. To the best of our knowledge this is the first study analyzing app usage patterns for purchase recommendation.

Cross Domain Regularization for Neural Ranking Models using Adversarial Learning

Unlike traditional learning to rank models that depend on hand-crafted features, neural representation learning models learn higher level features for the ranking task by training on large datasets. Their ability to learn new features directly from the data, however, may come at a price. Without any special supervision, these models learn relationships that may hold only in the domain from which the training data is sampled, and generalize poorly to domains not observed during training. We study the effectiveness of adversarial learning as a cross domain regularizer in the context of the ranking task. We use an adversarial discriminator and train our neural ranking model on a small set of domains. The discriminator provides a negative feedback signal to discourage the model from learning domain specific representations. Our experiments show consistently better performance on held out domains in the presence of the adversarial discriminator---sometimes up to 30% on [email protected]$.

Affective Representations for Sarcasm Detection

Sarcasm detection from text has gained increasing attention. While one thread of research has emphasized the importance of affective content in sarcasm detection, another avenue of research has explored the effectiveness of word representations. In this paper, we introduce a novel model for automated sarcasm detection in text, called Affective Word Embeddings for Sarcasm (AWES), which incorporates affective information into word representations. Extensive evaluation on sarcasm detection on six datasets across three domains of text (tweets, reviews and forum posts) demonstrates the effectiveness of the proposed model. The experimental results indicate that while sentiment affective representations yield best results on datasets comprising of short length text such as tweets, richer representations derived from fine-grained emotions are more suitable for detecting sarcasm from longer length documents such as product reviews and discussion forum posts.

A User Study on Snippet Generation: Text Reuse vs. Paraphrases

The snippets in the result list of a web search engine are built with sentences from the retrieved web pages that match the query. Reusing a web page's text for snippets has been considered fair use under the copyright laws of most jurisdictions. As of recent, notable exceptions from this arrangement include Germany and Spain, where news publishers are entitled to raise claims under a so-called ancillary copyright. A similar legislation is currently discussed at the European Commission. If this development gains momentum, the reuse of text for snippets will soon incur costs, which in turn will give rise to new solutions for generating truly original snippets. A key question in this regard is whether the users will accept any new approach for snippet generation, or whether they will prefer the current model of "reuse snippets." The paper in hand gives a first answer. A crowdsourcing experiment along with a statistical analysis reveals that our test users exert no significant preference for either kind of snippet. Notwithstanding the technological difficulty, this result opens the door to a new snippet synthesis paradigm.

New Embedded Representations and Evaluation Protocols for Inferring Transitive Relations

Beyond word embeddings, continuous representations of knowledge graph (KG) components, such as entities, types and relations, are widely used for entity mention disambiguation, relation inference and deep question answering. Great strides have been made in modeling general, asymmetric or antisymmetric KG relations using Gaussian, holographic, and complex embeddings. None of these directly enforce transitivity inherent in the is-instance-of and is-subtype-of relations. A recent proposal, called order embedding (OE), demands that the vector representing a subtype elementwise dominates the vector representing a supertype. However, the manner in which such constraints are asserted and evaluated have some limitations. In this short research note, we make three contributions specific to representing and inferring transitive relations. First, we propose and justify a significant improvement to the OE loss objective. Second, we propose a new representation of types as hyper-rectangular regions, that generalize and improve on OE. Third, we show that some current protocols to evaluate transitive relation inference can be misleading, and offer a sound alternative. Rather than use black-box deep learning modules off-the-shelf, we develop our training networks using elementary geometric considerations.

Taxi or Hitchhiking: Predicting Passenger's Preferred Service on Ride Sharing Platforms

Ride sharing apps like Uber and Didi Chuxing have played an important role in addressing the users' transportation needs, which come not only in huge volumes, but also in great variety. While some users prefer low-cost services such as carpooling or hitchhiking, others prefer more pricey options like taxi or premier services. Further analyses suggest that such preference may also be associated with different time and location. In this paper, we empirically analyze the preferred services and propose a recommender system which provides service recommendation based on temporal, spatial, and behavioral features. Offline simulations show that our system achieves a high prediction accuracy and reduces the user's effort in finding the desired service. Such a recommender system allows a more precise scheduling for the platform, and enables personalized promotions.

Towards Intent-Aware Contextual Music Recommendation: Initial Experiments

While activity-aware music recommendation has been shown to improve the listener experience, we posit that modeling the \em listening intent can further improve recommendation quality. In this paper, we perform initial exploration of the dominant music listening intents associated with common activities, using music retrieved from popular online music services. We show that these intents can be approximated through audio features of the music itself, and potentially improve recommendation quality. Our initial results, based on 10 common activities and 5 popular listening intents associated with these activities, support our hypothesis, and open a promising direction towards intent-aware contextual music recommendation.

Deep Domain Adaptation Based on Multi-layer Joint Kernelized Distance

Domain adaptation refers to the learning scenario where a model learned from the source data is applied on the target data which have the same categories but different distributions. In information retrieval, there exist application scenarios like cross domain recommendation characterized similarly. In this paper, by utilizing deep features extracted from the deep networks, we proposed to compute the multi-layer joint kernelized mean distance between the k th target data predicted as the i th category and all the source data of the j th category $d_ij ^k$. Then, target data $T_m$ that are most likely to belong to the i th category can be found by calculating the relative distance $d_ii ^k/\sum_j d_ij ^k$. By iteratively adding $T_m$ to the training data, the finetuned deep model can adapt on the target data progressively. Our results demonstrate that the proposed method can achieve a better performance compared to a number of state-of-the-art methods.

Do Not Pull My Data for Resale: Protecting Data Providers Using Data Retrieval Pattern Analysis

Data providers have a profound contribution to many fields such as finance, economy, and academia by serving people with both web-based and API-based query service of specialized data. Among the data users, there are data resellers who abuse the query APIs to retrieve and resell the data to make a profit, which harms the data provider's interests and causes copyright infringement. In this work, we define the "anti-data-reselling" problem and propose a new systematic method that combines feature engineering and machine learning models to provide a solution. We apply our method to a real query log of over 9,000 users with limited labels provided by a large financial data provider and get reasonable results, insightful observations, and real deployments.

K-plet Recurrent Neural Networks for Sequential Recommendation

Recurrent Neural Networks have been successful in learning meaningful representations from sequence data, such as text and speech. However, recurrent neural networks attempt to model only the overall structure of each sequence independently, which is unsuitable for recommendations. In recommendation system, an optimal model should not only capture the global structure, but also the localized relationships. This poses a great challenge in the application of recurrent neural networks to the sequence prediction problem. To tackle this challenge, we incorporate the neighbor sequences into recurrent neural networks to help detect local relationships. Thus we propose a K -plet R ecurrent Neural Network (Kr Network for short) to accommodate multiple sequences jointly, and then introduce two ways to model their interactions between sequences. Experimental results on benchmark datasets show that our proposed architecture Kr Network outperforms state-of-the-art baseline methods in terms of generalization, short-term and long term prediction accuracy.

Citation Worthiness of Sentences in Scientific Reports

Does this sentence need citation? In this paper, we introduce the task of citation worthiness for scientific texts at a sentence-level granularity. The task is to detect whether a sentence in a scientific article needs to be cited or not. It can be incorporated into citation recommendation systems to help automate the citation process by marking sentences where needed. It may also be useful for publishers to regularize the citation process. We construct a dataset using the ACL Anthology Reference Corpus; consisting of over 1.1M "not_cite" and 85K "cite" sentences. We study the performance of a set of state-of-the-art sentence classifiers for the citation worthiness task and show the practical challenges. We also explore section-wise difficulty of the task and analyze the performance of our best model on a published article.

SESSION: Short Research Papers II

Ad Click Prediction in Sequence with Long Short-Term Memory Networks: an Externality-aware Model

Ad click prediction is a task to estimate the click-through rate (CTR) in sponsored ads, the accuracy of which impacts user search experience and businesses' revenue. State-of-the-art sponsored search systems typically model it as a classification problem and employ machine learning approaches to predict the CTR per ad. In this paper, we propose a new approach to predict ad CTR in sequence which considers user browsing behavior and the impact of top ads quality to the current one. To the best of our knowledge, this is the first attempt in the literature to predict ad CTR by using Recurrent Neural Networks (RNN) with Long Short-Term Memory (LSTM) cells. The proposed model is evaluated on a real dataset and we show that LSTM-RNN outperforms DNN model on both AUC and RIG. Since the RNN inference is time consuming, a simplified version is also proposed, which can achieve more than half of the gain with the overall serving cost almost unchanged.

Assessing the Readability of Web Search Results for Searchers with Dyslexia

Standards organizations, (e.g., the World Wide Web Consortium), are placing increased importance on the cognitive accessibility of online systems, including web search. Previous work has shown an association between query-document relevance judgments, and query-independent assessments of document readability. In this paper we study the lexical and aesthetic features of web documents that may underlie this relationship. Leveraging a data set consisting of relevance and readability judgments for 200 web pages as assessed by 174 adults with dyslexia and 172 adults without dyslexia, we answer the following research questions: (1) Which web page features are most associated with readability? (2) To what extent are these features also associated with relevance? And, (3) are any features associated with the differences in readability/relevance judgments observed between dyslexic and non-dyslexic populations? Our findings have implications for improving the cognitive accessibility of search systems and web documents.

Comparing Two Binned Probability Distributions for Information Access Evaluation

Some modern information access tasks such as natural language dialogue tasks are difficult to evaluate, for often there is no such thing as the ground truth: different users may have different opinions about the system's output. A few task designs for dialogue evaluation have been implemented and/or proposed recently, where both the ground truth data and the system's output are represented as a distribution of users' votes over bins on a non-nominal scale. The present study first points out that popular bin-by-bin measures such as Jensen-Shannon divergence and Sum of Squared Errors are clearly not adequate for such tasks, and that cross-bin measures should be used. Through experiments using artificial distributions as well as real ones from a dialogue evaluation task, we demonstrate that two cross-bin measures, namely, the Normalised Match Distance (NMD; a special case of the Earth Mover's Distance) and the Root Symmetric Normalised Order-aware Divergence (RSNOD), are indeed substantially different from the bin-by-bin measures.Furthermore, RSNOD lies between the popular bin-by-bin measures and NMD in terms of how it behaves. We recommend using both of these measures in the aforementioned type of evaluation tasks.

Robust Asymmetric Recommendation via Min-Max Optimization

Recommender systems with implicit feedback (e.g. clicks and purchases) suffer from two critical limitations: 1) imbalanced labels may mislead the learning process of the conventional models that assign balanced weights to the classes; and 2) outliers with large reconstruction errors may dominate the objective function by the conventional $L_2$-norm loss. To address these issues, we propose a robust asymmetric recommendation model. It integrates cost-sensitive learning with capped unilateral loss into a joint objective function, which can be optimized by an iteratively weighted approach. To reduce the computational cost of low-rank approximation, we exploit the dual characterization of the nuclear norm to derive a min-max optimization problem and design a subgradient algorithm without performing full SVD. Finally, promising empirical results demonstrate the effectiveness of our algorithm on benchmark recommendation datasets.

From the PRP to the Low Prior Discovery Recall Principle for Recommender Systems

We revisit the Probability Ranking Principle in the context of recommender systems. We find a key difference in the retrieval protocol with respect to query-based search, that leads to the identification of different optimal ranking principles for discovery-oriented recommendation. Based on this finding, we revise the effectiveness of common non-personalized ranking functions in respect to the new principles. We run an experiment confirming and illustrating our theoretical analysis, and providing further observations and hints for reflection and future research.

Deep Learning for Epidemiological Predictions

Predicting new and urgent trends in epidemiological data is an important problem for public health, and has attracted increasing attention in the data mining and machine learning communities. The temporal nature of epidemiology data and the need for real-time prediction by the system makes the problem residing in the category of time-series forecasting or prediction. While traditional autoregressive (AR) methods and Gaussian Process Regression (GPR) have been actively studied for solving this problem, deep learning techniques have not been explored in this domain. In this paper, we develop a deep learning framework, for the first time, to predict epidemiology profiles in the time-series perspective. We adopt Recurrent Neural Networks (RNNs) to capture the long-term correlation in the data and Convolutional Neural Networks (CNNs) to fuse information from data of different sources. A residual structure is also applied to prevent overfitting issues in the training process. We compared our model with the most widely used AR models on USA and Japan datasets. Our approach provides consistently better results than these baseline methods.

Query Variation Performance Prediction for Systematic Reviews

When conducting systematic reviews, medical researchers heavily deliberate over the final query to pose to the information retrieval system. Given the possible query variations that they could construct, selecting the best performing query is difficult. This motivates a new type of query performance prediction (QPP) task where the challenge is to estimate the performance of a set of query variations given a particular topic. Query variations are the reductions, expansions and modifications of a given seed query under the hypothesis that there exists some variations (either generated from permutations or hand crafted) which will improve retrieval effectiveness over the original query. We use the CLEF 2017 TAR Collection, to evaluate sixteen pre and post retrieval predictors for the task of Query Variation Performance Prediction (QVPP). Our findings show the IDF based QPPs exhibits the strongest correlations with performance. However, when using QPPs to select the best query, little improvement over the original query can be obtained, despite the fact that there are query variations which perform significantly better. Our findings highlight the difficulty in identifying effective queries within the context of this new task, and motivates further research to develop more accurate methods to help systematic review researchers in the query selection process.

Attention-based Hierarchical Neural Query Suggestion

Query suggestions help users of a search engine to refine their queries. Previous work on query suggestion has mainly focused on incorporating directly observable features such as query co-occurrence and semantic similarity. The structure of such features is often set manually, as a result of which hidden dependencies between queries and users may be ignored. We propose an AHNQS model that combines a hierarchical structure with a session-level neural network and a user-level neural network to model the short- and long-term search history of a user. An attention mechanism is used to capture user preferences. We quantify the improvements of AHNQS over state-of-the-art RNN-based query suggestion baselines on the AOL query log dataset, with improvements of up to 21.86% and 22.99% in terms of [email protected] and [email protected], respectively, over the state-of-the-art; improvements are especially large for short sessions.

Texygen: A Benchmarking Platform for Text Generation Models

We introduce Texygen, a benchmarking platform to support research on open-domain text generation models. Texygen has not only implemented a majority of text generation models, but also covered a set of metrics that evaluate the diversity, the quality and the consistency of the generated texts. The Texygen platform could help standardize the research on text generation and improve the reproductivity and reliability of future research work in text generation.

CA-LSTM: Search Task Identification with Context Attention based LSTM

Search task identification aims to understand a user's information needs to improve search quality for applications such as query suggestion, personalized search, and advertisement retrieval. To properly identify the search task within long query sessions, it is important to partition these sessions into segments before further processing. In this paper, we present the first search session segmentation model that uses a long short-term memory (LSTM) network with an attention mechanism. This model considers sequential context knowledge, including temporal information, word and character, and essentially learns which parts of the input query sequence are relevant to the current segmentation task and determines the influence of these parts. This segmentation technique is also combined with an efficient clustering method using a novel query relevance metric for end-to-end search task identification. Using real-world datasets, we demonstrate that our segmentation technique improves task identification accuracy of existing clustering-based algorithms.

On the Volatility of Commercial Search Engines and its Impact on Information Retrieval Research

We studied the volatility of commercial search engines and reflected on its impact on research that uses them as basis of algorithmical techniques or for user studies. Search engine volatility refers to the fact that a query posed to a search engine at two different points in time returns different documents. By comparing search results retrieved every 2 days over a period of 64 days, we found that the considered commercial search engine API consistently presented volatile search results: it both retrieved new documents, and it ranked documents previously retrieved at different ranks throughout time. Moreover, not only results are volatile: we also found that the effectiveness of the search engine in answering a query is volatile. Our findings reaffirmed that results from commercial search engines are volatile and that care should be taken when using these as basis for researching new information retrieval techniques or performing user studies.

Deep Semantic Text Hashing with Weak Supervision

With an ever increasing amount of data available on the web, fast similarity search has become the critical component for large-scale information retrieval systems. One solution is semantic hashing which designs binary codes to accelerate similarity search. Recently, deep learning has been successfully applied to the semantic hashing problem and produces high-quality compact binary codes compared to traditional methods. However, most state-of-the-art semantic hashing approaches require large amounts of hand-labeled training data which are often expensive and time consuming to collect. The cost of getting labeled data is the key bottleneck in deploying these hashing methods. Motivated by the recent success in machine learning that makes use of weak supervision, we employ unsupervised ranking methods such as BM25 to extract weak signals from training data. We further introduce two deep generative semantic hashing models to leverage weak signals for text hashing. The experimental results on four public datasets show that our models can generate high-quality binary codes without using hand-labeled training data and significantly outperform the competitive unsupervised semantic hashing baselines.

Who is the Mr. Right for Your Brand?: -- Discovering Brand Key Assets via Multi-modal Asset-aware Projection

Following the rising prominence of online social networks, we observe an emerging trend for brands to adopt influencer marketing, embracing key opinion leaders (KOLs) to reach potential customers (PCs) online. Owing to the growing strategic importance of these brand key assets, this paper presents a novel feature extraction method named Multi-modal Asset-aware Projection (M2A2P) to learn a discriminative subspace from the high-dimensional multi-modal social media data for effective brand key asset discovery. By formulating a new asset-aware discriminative information preserving criterion, M2A2P differentiates with the existing multi-model feature extraction algorithms in two pivotal aspects: 1) We consider brand's highly imbalanced class interest steering towards the KOLs and PCs over the irrelevant users; 2) We consider a common observation that a user is not exclusive to a single class (e.g. a KOL can also be a PC). Experiments on a real-world apparel brand key asset dataset validate the effectiveness of the proposed method.

Sogou-QCL: A New Dataset with Click Relevance Label

Data is of vital importance in the development of machine learning technologies. Recently, within the information retrieval field, a number of neural ranking frameworks have been proposed to address the ad-hoc search. These models usually need a large amount of query-document relevance judgments for training. However, obtaining this kind of relevance judgments needs a lot of money and manual effort. To shed light on this problem, researchers seek to use implicit feedback from users of search engines to improve the ranking performance. In this paper, we present a new dataset, Sogou-QCL, which contains 537,366 queries and five kinds of weak relevance labels for over 12 million query-document pairs. We apply Sogou-QCL dataset to train recent neural ranking models and show its potential to serve as weak supervision for ranking. We believe that Sogou-QCL will have a broad impact on corresponding areas.

Towards Designing Better Session Search Evaluation Metrics

User satisfaction has been paid much attention to in recent Web search evaluation studies and regarded as the ground truth for designing better evaluation metrics. However, most existing studies are focused on the relationship between satisfaction and evaluation metrics at query-level. However, while search request becomes more and more complex, there are many scenarios in which multiple queries and multi-round search interactions are needed (e.g. exploratory search). In those cases, the relationship between session-level search satisfaction and session search evaluation metrics remain uninvestigated. In this paper, we analyze how users' perceptions of satisfaction accord with a series of session-level evaluation metrics. We conduct a laboratory study in which users are required to finish some complex search tasks and provide usefulness judgments of documents as well as session-level and query level satisfaction feedbacks. We test a number of popular session search evaluation metrics as well as different weighting functions. Experiment results show that query-level satisfaction is mainly decided by the clicked document that they think the most useful (maximum effect). While session-level satisfaction is highly correlated with the most recently issued queries (recency effect). We further propose a number of criteria for designing better session search evaluation metrics.

Predicting Contradiction Intensity: Low, Strong or Very Strong?

Reviews on web resources (e.g. courses, movies) become increasingly exploited in text analysis tasks (e.g. opinion detection, controversy detection). This paper investigates contradiction intensity in reviews exploiting different features such as variation of ratings and variation of polarities around specific entities (e.g. aspects, topics). Firstly, aspects are identified according to the distributions of the emotional terms in the vicinity of the most frequent nouns in the reviews collection. Secondly, the polarity of each review segment containing an aspect is estimated. Only resources containing these aspects with opposite polarities are considered. Finally, some features are evaluated, using feature selection algorithms, to determine their impact on the effectiveness of contradiction intensity detection. The selected features are used to learn some state-of-the-art learning approaches. The experiments are conducted on the Massive Open Online Courses data set containing 2244 courses and their 73,873 reviews, collected from coursera.org. Results showed that variation of ratings, variation of polarities, and reviews quantity are the best predictors of contradiction intensity. Also, J48 was the most effective learning approach for this type of classification.

A Large-Scale Study of Mobile Search Examination Behavior

With the rapid growth of mobile web search, it is necessary and important to understand user's examination behavior on mobile devices in the absence of clicks. Previous studies used viewport metrics to estimate user's attention. However, there still lacks an in-depth understanding of how search users examine and interact with the mobile SERP. In this work, based on the large-scale real search log collected from a popular commercial mobile search engine, we present a comprehensive analysis of examination behavior. Specifically, we analyze the position bias, the relationship with click behavior, and examination's change as the session continues. The findings shed new light on the understanding of user's examination behavior, and also provide some implication for the improvement and evaluation of mobile search engine.

Ranking Without Learning: Towards Historical Relevance-based Ranking of Social Images

Tag-based Social Image Retrieval (TagIR) aims to find relevant social images using keyword queries. State-of-the-art TagIR techniques typically rank query results based on relevance, temporal or popularity criteria. However, these criteria may not always be sufficient to match diverse search intents of users. In this paper, we present a novel ranking scheme that ranks query results (images) based on their historical relevance. Informally, an image is historically relevant if its visual content is relevant to the query and it depicts objects, scenes, or events that are related to human history. To this end, we propose a learning-agnostic technique that leverages Wikipedia to quantify historical relevance of images. We empirically demonstrate the effectiveness of our ranking scheme using Flickr dataset.

Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate

Estimating post-click conversion rate (CVR) accurately is crucial for ranking systems in industrial applications such as recommendation and advertising. Conventional CVR modeling applies popular deep learning methods and achieves state-of-the-art performance. However it encounters several task-specific problems in practice, making CVR modeling challenging. For example, conventional CVR models are trained with samples of clicked impressions while utilized to make inference on the entire space with samples of all impressions. This causes a sample selection bias problem. Besides, there exists an extreme data sparsity problem, making the model fitting rather difficult. In this paper, we model CVR in a brand-new perspective by making good use of sequential pattern of user actions, i.e., impression -> click -> conversion. The proposed Entire Space Multi-task Model (ESMM) can eliminate the two problems simultaneously by i) modeling CVR directly over the entire space, ii) employing a feature representation transfer learning strategy. Experiments on dataset gathered from Taobao's recommender system demonstrate that ESMM significantly outperforms competitive methods. We also release a sampling version of this dataset to enable future research. To the best of our knowledge, this is the first public dataset which contains samples with sequential dependence of click and conversion labels for CVR modeling.

How Much is Too Much?: Whole Session vs. First Query Behaviors in Task Type Prediction

One of the emerging and important problems in Interactive Information Retrieval research is predicting search tasks. Given a searcher's behavior during a search session, can the searcher's task be predicted? Which aspects of the task can be predicted, and how quickly and how accurately can the prediction be made? Much past literature has examined relationships between browsing behavior and task type at a statistical level, and recent work is moving towards prediction. While one may think whole session measures are useful for prediction, recent findings on common measures have suggested the contrary. Can less of the session still be useful? We examine the opposite end: the first query. Using multiple data sets for comparison, our results suggest that first query measures can be at least as good as -- and sometimes better than -- whole session measures for certain task type predictions.

Effectiveness Evaluation with a Subset of Topics: A Practical Approach

Several researchers have proposed to reduce the number of topics used in TREC-like initiatives. One research direction that has been pursued is what is the optimal topic subset of a given cardinality that evaluates the systems/runs in the most accurate way. Such a research direction has been so far mainly theoretical, with almost no indication on how to select the few good topics in practice. We propose such a practical criterion for topic selection: we rely on the methods for automatic system evaluation without relevance judgments, and by running some experiments on several TREC collections we show that the topics selected on the basis of those evaluations are indeed more informative than random topics.

Locality-adapted Kernel Densities for Tweet Localization

We propose a location prediction method for tweets based on the geographical probability distribution of their terms over a region. In our method, the probabilities are calculated using Kernel Density Estimation (KDE), where the bandwidth of the kernel function for each term is determined separately according to the location indicativeness of the term. Prediction for a new tweet is performed by combining the probability distributions of its terms weighted by their information gain ratio. The method we propose relies on statistical approaches without requiring any parameter tuning. Experiments conducted on three tweet sets from different regions of the world indicate significant improvement in prediction accuracy compared to the state-of-the-art methods.

Related or Duplicate: Distinguishing Similar CQA Questions via Convolutional Neural Networks

Plenty of research attempts target the automatic duplicate detection in Community Question Answering (CQA) systems and frame the task as a supervised learning problem on the question pairs. However, these methods rely on handcrafted features, leading to the difficulty of distinguishing related and duplicate questions as they are often textually similar. To tackle this issue, we propose to leverage neural network architecture to extract "deep" features to identify whether a question pair is duplicate or related. In particular, we construct question correlation matrices, which capture the word-wise similarities between questions. The constructed matrices are input to our proposed convolutional neural network (CNN), in which the convolutional operation moves through the two dimensions of the matrices. Empirical studies on a range of real-world CQA datasets confirm the effectiveness of our proposed correlation matrices and the CNN. Our method outperforms the state-of-the-art methods and achieves better classification performance.

Procrastination is the Thief of Time: Evaluating the Effectiveness of Proactive Search Systems

Users of current search systems actively interact with the system to complete their search task. This can encompass formulating and reformulating a series queries expressing evolving of different information needs. We believe that the next generation of search systems will see a shift towards proactive understanding of user intent based on analysis of user activities. Such a proactive search system could start recommending documents that are likely to help users accomplish their tasks without requiring them to explicitly submit queries to the system. We propose a framework to evaluate such a search system. The key idea behind our proposed metric is to aggregate a correlation measure over a search session between the expected outcome, which in this case refers to the list of documents retrieved with a true user query, and the predicted outcome, which refers to the list of documents recommended by a proactive search system. Experiments on the AOL query log data show that the ranking of two sample proactive IR systems induced by our metric conforms to the expected ranking between these systems.

Convolution-based Memory Network for Aspect-based Sentiment Analysis

Memory networks have shown expressive performance on aspect based sentiment analysis. However, ordinary memory networks only capture word-level information and lack the capacity for modeling complicated expressions which consist of multiple words. Targeting this problem, we propose a novel convolutional memory network which incorporates an attention mechanism. This model sequentially computes the weights of multiple memory units corresponding to multi-words. This model may capture both words and multi-words expressions in sentences for aspect-based sentiment analysis. Experimental results show that the proposed model outperforms the state-of-the-art baselines.

WikiPassageQA: A Benchmark Collection for Research on Non-factoid Answer Passage Retrieval

With the rise in mobile and voice search, answer passage retrieval acts as a critical component of an effective information retrieval system for open domain question answering. Currently, there are no comparable collections that address non-factoid question answering within larger documents while simultaneously providing enough examples sufficient to train a deep neural network. In this paper, we introduce a new Wikipedia based collection specific for non-factoid answer passage retrieval containing thousands of questions with annotated answers and show benchmark results on a variety of state of the art neural architectures and retrieval models. The experimental results demonstrate the unique challenges presented by answer passage retrieval within topically relevant documents for future research.

Beyond Pooling

Dynamic Sampling is a novel, non-uniform, statistical sampling strategy in which documents are selected for relevance assessment based on the results of prior assessments. Unlike static and dynamic pooling methods that are commonly used to compile relevance assessments for the creation of information retrieval test collections, Dynamic Sampling yields a statistical sample from which substantially unbiased estimates of effectiveness measures may be derived. In contrast to static sampling strategies, which make no use of relevance assessments, Dynamic Sampling is able to select documents from a much larger universe, yielding superior test collections for a given budget of relevance assessments. These assertions are supported by simulation studies using secondary data from the TREC 2017 Common Core Track.

Testing the Cluster Hypothesis with Focused and Graded Relevance Judgments

The cluster hypothesis is a fundamental concept in ad hoc retrieval. Heretofore, cluster hypothesis tests were applied to documents using binary relevance judgments. We present novel tests that utilize graded and focused relevance judgments; the latter are markups of relevant text in relevant documents. Empirical exploration reveals that the cluster hypothesis holds not only for documents, but also for passages, as measured by the proposed tests. Furthermore, the hypothesis holds to a higher extent for highly relevant documents and for those that contain a high fraction of relevant text.

Query Performance Prediction Focused on Summarized Letor Features

Query performance prediction (QPP) aims at automatically estimating the information retrieval system effectiveness for any user's query. Previous work has investigated several types of pre- and post-retrieval query performance predictors; the latter has been shown to be more effective. In this paper we investigate the use of features that were initially defined for learning to rank in the task of QPP. While these features have been shown to be useful for learning to rank documents, they have never been studied as query performance predictors. We developed more than 350 variants of them based on summary functions. Conducting experiments on four TREC standard collections, we found that Letor-based features appear to be better QPP than predictors from the literature. Moreover, we show that combining the best Letor features outperforms the state of the art query performance predictors. This is the first study that considers such an amount and variety of Letor features for QPP and that demonstrates they are appropriate for this task.

A Study of Per-Topic Variance on System Comparison

Under the notion that the document collection is a sample from a population, the observed per-topic metric (e.g., AP) value varies with different samples, leading to the per-topic variance. The results of the system comparison, such as comparing the ranking of systems according to the summary metric (e.g., MAP) or testing whether there is significant difference between two systems, are affected by the variability of per-topic metric values. In this paper, we study the effect of per-topic variance on the system comparison. To measure such effects, we employ two ranking-based methods, i.e., Error Rate (ER) and Kendall Rank Correlation Coefficient (KRCC), as well as two significance test based methods, namely Achieved Significance Level (ASL) and Estimated Difference (ED). We conduct empirical comparison of TREC participated systems on Robust and Adhoc track, which shows that the effect of per-topic variance on the ranking of systems is not obvious, while the significance test based comparisons are susceptible to the per-topic variance.

Online Job Search: Study of Users' Search Behavior using Search Engine Query Logs

Over the last few years, an increasing number of user's and enterprises on the internet has generated a global marketplace for both employers and job seekers. Despite the fact that online job search is now more preferable than traditional methods - leading to better matches between the job seekers and the employer's intents - there is still little insight into how online job searches are different from general web searches. In this paper, we explore the different characteristics of online job search and their differences with general searches, by leveraging search engine query logs. Our experimental results show that job searches have specific attributes which can be used by search engines to increase the quality of the search results.

Identifying and Modeling Information Resumption Behaviors in Cross-Device Search

Enlightened by task resumption behaviors in cross-session search, we explore information resumption behaviors in cross-device search. In order to find important features of information resumption behaviors, we conducted a user experiment and modeled information resumption behaviors using machine learning. The model of C5.0 Decision Tree outperformed and showed that features of FamiliarityScores, AveEditDistance, AveQueryEffectiveRate and ValidClickRate are of importance.

Optimizing Query Evaluations Using Reinforcement Learning for Web Search

In web search, typically a candidate generation step selects a small set of documents---from collections containing as many as billions of web pages---that are subsequently ranked and pruned before being presented to the user. In Bing, the candidate generation involves scanning the index using statically designed match plans that prescribe sequences of different match criteria and stopping conditions. In this work, we pose match planning as a reinforcement learning task and observe up to 20% reduction in index blocks accessed, with small or no degradation in the quality of the candidate sets.

SAAN: A Sentiment-Aware Attention Network for Sentiment Analysis

Analyzing public opinions towards products, services and social events is an important but challenging task. Despite the remarkable successes of deep neural networks in sentiment analysis, these approaches do not make full use of the prior sentiment knowledge (e.g., sentiment lexicon, negation words, intensity words). In this paper, we propose a Sentiment-Aware Attention Network (SAAN) to boost the performance of sentiment analysis, which adopts a three-step strategy to learn the sentiment-specific sentence representation. First, we employ a word-level mutual attention mechanism to model word-level correlation. Next, a phrase-level convolutional attention is designed to obtain phrase-level correlation. Finally, a sentence-level multi-head attention mechanism is proposed to capture various sentimental information from different subspaces. The experiments on Movie Review (MR) and Stanford Sentiment Treebank (SST) show that our model consistently outperform the previous methods for sentiment analysis.

An Attribute-aware Neural Attentive Model for Next Basket Recommendation

Next basket recommendation is a new type of recommendation, which recommends a set of items, or a basket, to the user. Purchase in basket is a common behavior of consumers. Recently, deep neural networks have been applied to model sequential transactions of baskets in next basket recommendation. However, current methods do not track the user's evolving appetite for items explicitly, and they ignore important item attributes such as product category. In this paper, we propose a novel Attribute-aware Neural Attentive Model (ANAM) to address these problems. ANAM adopts an attention mechanism to explicitly model user's evolving appetite for items, and utilizes a hierarchical architecture to incorporate the attribute information. In specific, ANAM utilizes a recurrent neural network to model the user's sequential behavior over time, and relays the user's appetite toward items and their attributes to next basket through attention weights shared across baskets on the two different hierarchies. Experiment results on two public datasets (ıe Ta-Feng and JingDong) demonstrate the effectiveness of our ANAM model for next basket recommendation.

Characterizing Question Facets for Complex Answer Retrieval

Complex answer retrieval (CAR) is the process of retrieving answers to questions that have multifaceted or nuanced answers. In this work, we present two novel approaches for CAR based on the observation that question facets can vary in utility: from structural (facets that can apply to many similar topics, such as 'History') to topical (facets that are specific to the question's topic, such as the 'Westward expansion' of the United States). We first explore a way to incorporate facet utility into ranking models during query term score combination. We then explore a general approach to reform the structure of ranking models to aid in learning of facet utility in the query-document term matching phase. When we use our techniques with a leading neural ranker on the TREC CAR dataset, our methods yield statistically significant improvements over both an unmodified neural architecture and submitted TREC runs.

A Test Collection for Coreferent Mention Retrieval

This paper introduces the coreferent mention retrieval task, in which the goal is to retrieve sentences that mention a specific entity based on a query by example in which one sentence mentioning that entity is provided. The development of a coreferent mention retrieval test collection is then described. Results are presented for five coreferent mention retrieval systems, both to illustrate the use of the collection and to specify the results that were pooled on which human coreference judgments were performed. The new test collection is built from content that is available from the Linguistic Data Consortium; the partitioning and human annotations used to create the test collection atop that content are being made freely available.

What Do Viewers Say to Their TVs?: An Analysis of Voice Queries to Entertainment Systems

A recently-introduced product of Comcast, a large cable company in the United States, is a "voice remote" that accepts spoken queries from viewers. We present an analysis of a large query log from this service to answer the question: "What do viewers say to their TVs?" In addition to a descriptive characterization of queries and sessions, we describe two complementary types of analyses to support query understanding. First, we propose a domain-specific intent taxonomy to characterize viewer behavior: as expected, most intents revolve around watching programs---both direct navigation as well as browsing---but there is a non-trivial fraction of non-viewing intents as well. Second, we propose a domain-specific tagging scheme for labeling query tokens, that when combined with intent and program prediction, provides a multi-faceted approach to understand voice queries directed at entertainment systems.

Sanity Check: A Strong Alignment and Information Retrieval Baseline for Question Answering

While increasingly complex approaches to question answering (QA) have been proposed, the true gain of these systems, particularly with respect to their expensive training requirements, can be in- flated when they are not compared to adequate baselines. Here we propose an unsupervised, simple, and fast alignment and informa- tion retrieval baseline that incorporates two novel contributions: a one-to-many alignment between query and document terms and negative alignment as a proxy for discriminative information. Our approach not only outperforms all conventional baselines as well as many supervised recurrent neural networks, but also approaches the state of the art for supervised systems on three QA datasets. With only three hyperparameters, we achieve 47% [email protected] on an 8th grade Science QA dataset, 32.9% [email protected] on a Yahoo! answers QA dataset and 64% MAP on WikiQA.

Explaining Controversy on Social Media via Stance Summarization

In an era in which new controversies rapidly emerge and evolve on social media, navigating social media platforms to learn about a new controversy can be an overwhelming task. In this light, there has been significant work that studies how to identify and measure controversy online. However, we currently lack a tool for effectively understanding controversy in social media. For example, users have to manually examine postings to find the arguments of conflicting stances that make up the controversy. In this paper, we study methods to generate a stance-aware summary that explains a given controversy by collecting arguments of two conflicting stances. We focus on Twitter and treat the stance summarization as a ranking problem of finding the top k tweets that best summarize the two conflicting stances of a controversial topic. We formalize the characteristics of a good stance summary and propose a ranking model accordingly. We first evaluate our methods on five controversial topics on Twitter. Our user study shows that our methods consistently outperform other baseline techniques in generating a summary that explains the given controversy.

Identifying Clickbait: A Multi-Strategy Approach Using Neural Networks

Online media outlets, in a bid to expand their reach and subsequently increase revenue through ad monetisation, have begun adopting clickbait techniques to lure readers to click on articles. The article fails to fulfill the promise made by the headline. Traditional methods for clickbait detection have relied heavily on feature engineering which, in turn, is dependent on the dataset it is built for. The application of neural networks for this task has only been explored partially. We propose a novel approach considering all information found in a social media post. We train a bidirectional LSTM with an attention mechanism to learn the extent to which a word contributes to the post's clickbait score in a differential manner. We also employ a Siamese net to capture the similarity between source and target information. Information gleaned from images has not been considered in previous approaches. We learn image embeddings from large amounts of data using Convolutional Neural Networks to add another layer of complexity to our model. Finally, we concatenate the outputs from the three separate components, serving it as input to a fully connected layer. We conduct experiments over a test corpus of 19538 social media posts, attaining an F1 score of 65.37% on the dataset bettering the previous state-of-the-art, as well as other proposed approaches, feature engineering or otherwise.

Multi-Target Stance Detection via a Dynamic Memory-Augmented Network

Stance detection aims at inferring from text whether the author is in favor of, against, or neutral towards a target entity. Most of the existing studies consider different target entities separately. However, in many scenarios, stance targets are closely related, such as several candidates in a general election and different brands of the same product. Multi-target stance detection, in contrast, aims at jointly detecting stances towards multiple related targets. As stance expression regarding a target can provide additional information to help identify the stances towards other related targets, modeling expressions regarding multiple targets jointly is beneficial for improving the overall performance compared to single-target scheme. In this paper, we propose a dynamic memory-augmented network DMAN for multi-target stance detection. DMAN utilizes a shared external memory, which is dynamically updated through the learning process, to capture and store stance-indicative information for multiple related targets. It then jointly predicts stances towards these targets in a multitask manner. Experimental results show the effectiveness of our DMAN model.

Query Performance Prediction and Effectiveness Evaluation Without Relevance Judgments: Two Sides of the Same Coin

Some methods have been developed for automatic effectiveness evaluation without relevance judgments. We propose to use those methods, and their combination based on a machine learning approach, for query performance prediction. Moreover, since predicting average precision as it is usually done in query performance prediction literature is sensitive to the reference system that is chosen, we focus on predicting the average of average precision values over several systems. Results of an extensive experimental evaluation on ten TREC collections show that our proposed methods outperform state-of-the-art query performance predictors.

A New Term Frequency Normalization Model for Probabilistic Information Retrieval

In probabilistic BM25, term frequency normalization is one of the key components. It is often controlled by parameters $k_1$ and b , which need to be optimized for each given data set. In this paper, we assume and show empirically that term frequency normalization should be specific with query length in order to optimize retrieval performance. Following this intuition, we first propose a new term frequency normalization with query length for probabilistic information retrieval, namely \textttBM25\tiny QL . Then \textttBM25\tiny QL is incorporated into the state-of-the-art models CRTER riptsize 2 and LDA-BM25, denoted as $\textttCRTER riptsize 2 ^\texttt\tiny QL $ and \textttLDA-BM25\tiny QL respectively. A series of experiments show that our proposed approaches \textttBM25\tiny QL , $\textttCRTER riptsize 2 ^\texttt\tiny QL $ and \textttLDA-BM25\tiny QL are comparable to BM25, CRTER riptsize 2 and LDA-BM25 with the optimal b setting in terms of MAP on all the data sets.

Transparent Tree Ensembles

Every day more technologies and services are backed by complex machine-learned models, consuming large amounts of data to provide a myriad of useful services. While users are willing to provide personal data to enable these services, their trust in and engagement with the systems could be improved by providing insight into how the machine learned decisions were made. Complex ML systems are highly effective but many of them are black boxes and give no insight into how they make the choices they make. Moreover, those that do often do so at the model-level rather than the instance-level. In this work we present a method for deriving explanations for instance-level decisions in tree ensembles. As this family of models accounts for a large portion of industrial machine learning, this work opens up the possibility for transparent models at scale.

A Taxonomy of Queries for E-commerce Search

Understanding the search tasks and search behavior of users is necessary for optimizing search engine results. While much work has been done on understanding the users in Web search, little knowledge is available about the search tasks and behavior of users in the E-Commerce (E-Com) search applications. In this paper, we share the first empirical study of the queries and search behavior of users in E-Com search by analyzing search log from a major E-Com search engine. The analysis results show that E-Com queries can be categorized into five categories, each with distinctive search behaviors: (1) Shallow Exploration Queries are short vague queries that a user may use initially in exploring the product space. (2) Targeted Purchase Queries are queries used by users to purchase items that they are generally familiar with, thus without much decision making. (3) Major-Item Shopping Queries are used by users to shop for a major item which is often relatively expensive and thus requires some serious exploration, but typically in a limited scope of choices. (4) Minor-Item Shopping Queries are used by users to shop for minor items that are generally not very expensive, but still require some exploration of choices. (5) Hard-Choice Shopping Queries are used by users who want to deeply explore all the candidate products before finalizing the choice often appropriate when multiple products must be carefully compared with each other. These five categories form a taxonomy for E-Com queries and can shed light on how we may develop customized search technologies for each type of search queries to improve search engine utility.

Theoretical Analysis of Interdependent Constraints in Pseudo-Relevance Feedback

Axiomatic analysis is a well-defined theoretical framework for analytical evaluation of information retrieval models. The current studies in axiomatic analysis implicitly assume that the constraints (axioms) are independent. In this paper, we revisit this assumption and hypothesize that there might be interdependence relationships between the existing constraints. As a preliminary study, we focus on the pseudo-relevance feedback (PRF) models that have been theoretically studied using the axiomatic analysis approach. In this paper, we introduce two novel interdependent PRF constraints which emphasize on the effect of existing constraints on each other. We further modify two state-of-the-art PRF models, log-logistic and relevance models, in order to satisfy the proposed constraints. Experiments on three TREC newswire and web collections demonstrate that the proposed modifications significantly outperform the baselines, in all cases.

Unsupervised Cross-Lingual Information Retrieval Using Monolingual Data Only

We propose a fully unsupervised framework for ad-hoc cross-lingual information retrieval (CLIR) which requires no bilingual data at all. The framework leverages shared cross-lingual word embedding spaces in which terms, queries, and documents can be represented, irrespective of their actual language. The shared embedding spaces are induced solely on the basis of monolingual corpora in two languages through an iterative process based on adversarial neural networks. Our experiments on the standard CLEF CLIR collections for three language pairs of varying degrees of language similarity (English-Dutch/Italian/Finnish) demonstrate the usefulness of the proposed fully unsupervised approach. Our CLIR models with unsupervised cross-lingual embeddings outperform baselines that utilize cross-lingual embeddings induced relying on word-level and document-level alignments. We then demonstrate that further improvements can be achieved by unsupervised ensemble CLIR models. We believe that the proposed framework is the first step towards development of effective CLIR models for language pairs and domains where parallel data are scarce or non-existent.

Toward Voice Query Clarification

Query suggestions are a standard means to clarify the intent of underspecified queries. In a voice-based search setting, the compilation of query suggestions is not straightforward, and user-centric research targeting query underspecification is lacking so far. Our paper analyses a specific type of ambiguous voice queries and studies the impact of various kinds of voice query clarifications offered by the system and its impact on user satisfaction. We conduct a user study that measures the satisfaction for clarifications that are explicitly invoked and presented by seven different methods. Our findings include that (1) user experience depends on language proficiency levels, (2) users are not dissatisfied when prompted for clarifications (in fact, enjoy it sometimes), and (3) the most effective way of query clarification depends on the number and lengths of the possible answers.

A Test Collection for Evaluating Legal Case Law Search

Test collection based evaluation represents the standard of evalua- tion for information retrieval systems. Legal IR, more speci cally case law retrieval, has no such standard test collection for evalua- tion. In this paper, we present a test collection for use in evaluating case law search, being the retrieval of judicial decisions relevant to a particular legal question. The collection is made available at ielab.io/caselaw.

SESSION: Demonstration Papers I

SearchX: Empowering Collaborative Search Research

Collaborative search has been an active area of research within the IR community for many years. While for "single-user'' research a variety of up-to-date open-source search systems exist, few "multi-user'' search tools are open-source and even fewer are being maintained. In this paper, we present SearchX, an open-source collaborative search system we are currently developing-and using for our research. We designed and built SearchX using the modern Web stack (and are thus not siloed by an operating system or a particular browser type), enabling efficient research across platforms (Desktop, mobile) and with online users (e.g. crowdworkers). A video, describing the demo can be found at https: //www.youtube.com/watch?v=uf24m6p3vts.

A/B Testing with APONE

In order to improve long-term retention, ad conversion rates, and so on, A/B testing has become the norm within Web portals, enabling efficient large-scale experimentation. While A/B testing is also increasingly used by academic researchers (with crowd-working platforms offering a large pool of artificial users), few platforms are freely available to this end. Academic researchers usually develop adhoc solutions, leading to many duplicated efforts and time spent on work not directly related to one's research. As an alternative, we have developed and open sourced APONE, an A cademic P latform for ON line Experiments. APONE uses PlanOut, a framework and high-level language, to specify online experiments, and offers Web services and a Web GUI to easily create, manage and monitor them. By building a user friendly Web application, we enable not only experts to conduct valid A/B experiments. In particular as a secondary use case, we envision large classrooms to also benefit from the deployment of APONE, a vision we put into practice in a graduate Information Retrieval course. We open-source APONE at https://marrerom.github.io/APONE. A demo version is running at http://ireplatform.ewi.tudelft.nl:8080/APONE.

CoNEREL: Collective Information Extraction in News Articles

We present CoNEREL, a system for collective named entity recognition and entity linking focusing on news articles and readers' comments. Different from other systems, CoNEREL processes articles and comments in batch mode, to make the best use of the shared contexts of multiple news stories and their comments. Particularly, a news article provides context for all its comments. To improve named entity recognition, CoNEREL utilizes co-reference of mentions to refine their class labels ( e.g. , person, location). To link the recognized entities to Wikipedia, our system implements Pair-Linking, a state-of-the-art entity linking algorithm. Furthermore, CoNEREL provides an interactive visualization of the Pair-Linking process. From the visualization, one can understand how Pair-Linking achieves decent linking performance through iterative evidence building, while being extremely fast and efficient. The graph formed by the Pair-Linking process naturally becomes a good summary of entity relations, making CoNEREL a useful tool to study the relationships between the entities mentioned in an article, as well as the ones that are discussed in its comments.

A2A: Benchmark Your Clinical Decision Support Search

Clinical Decision Support (CDS) systems aim to assist clinicians in their daily decision-making related to diagnosis, tests, and treatments of patients by providing relevant evidence from the scientific literature. This promise however is yet to be fulfilled, with search for relevant literature for a given patient condition still being an active research topic. The TREC CDS track was designed to address this research gap. We developed a platform to facilitate experimentation and hypothesis testing for information retrieval researchers working on this topic. It provides a large range of query and document processing techniques that are explored in the biomedical search domain.

An Information Retrieval Experiment Framework for Domain Specific Applications

We present a framework for constructing and executing information retrieval experiment pipelines. The framework as a whole is built primarily for domain specific applications such as medical literature search for systematic reviews, or finding factually or legally applicable case law in the legal domain; however it can also be used for more general tasks. There are a number of pre-implemented components that enable common information retrieval experiments such as ad-hoc retrieval or query analysis through query performance predictors. In addition, this collection of tools seeks to be user friendly, well documented, and easily extendible. Finally, the entire pipeline can be distributed as a single binary with no dependencies, ready to use with a simple domain specific language (DSL) for constructing pipelines.

Vote Goat: Conversational Movie Recommendation

Conversational search and recommendation systems that use natural language interfaces are an increasingly important area raising a number of research and interface design questions. Despite the increasing popularity of digital personal assistants, the number of conversational recommendation systems is limited and their functionality basic. In this demonstration we introduce Vote Goat, a conversational recommendation agent built using Google's DialogFlow framework. The demonstration provides an interactive movie recommendation system using a speech-based natural language interface. The main intents span search and recommendation tasks including: rating movies, receiving recommendations, retrieval over movie metadata, and viewing crowdsourced statistics. Vote Goat uses gamification to incentivize movie voting interactions with the 'Greatest Of All Time' (GOAT) movies derived from user ratings. The demo includes important functionality for research applications with logging of interactions for building test collections as well as A/B testing to allow researchers to experiment with system parameters.

LogCanvas: Visualizing Search History Using Knowledge Graphs

In this demo paper, we introduce LogCanvas, a platform for user search history visualization.Different from the existing visualization tools, LogCanvas focuses on helping users re-construct the semantic relationship among their search activities. LogCanvas segments a user's search history into different sessions and generates a knowledge graph to represent the information exploration process in each session.A knowledge graph is composed of the most important concepts or entities discovered by each search query as well as their relationships. It thus captures the semantic relationship among the queries.LogCanvas offers a session timeline viewer and a snippets viewer to enable users to re-find their previous search results efficiently. LogCanvas also provides a collaborative perspective to support a group of users in sharing search results and experience.

API Caveat Explorer -- Surfacing Negative Usages from Practice: An API-oriented Interactive Exploratory Search System for Programmers

Application programming interface (API) documentation well describes an API and how to use it. However, official documentation does not describe "how not to use it" or the different kinds of errors when an API is used wrongly. Programming caveats are negative usages of an API. When these caveats are overlooked, errors may emerge, leading to heavy discussions on Q&A websites like Stack Overflow. In this demonstration, we present API Caveat Explorer, a search system to explore API caveats that are mined from large-scale unstructured discussions on Stack Overflow. API Caveat Explorer takes API-oriented queries such as "HashMap" and retrieves API caveats by text summarization techniques. API caveats are represented by sentences, which are context-independent, prominent, semantically diverse and non-redundant. The system provides a web-based interface that allows users to interactively explore the full picture of all discovered caveats of an API, and the details of each. The potential users of API Caveat Explorer are programmers and educators for learning and teaching APIs.

SmartTable: A Spreadsheet Program with Intelligent Assistance

We introduce SmartTable, an online spreadsheet application that is equipped with intelligent assistance capabilities. With a focus on relational tables, describing entities along with their attributes, we offer assistance in two flavors: (i) for populating the table with additional entities (rows) and (ii) for extending it with additional entity attributes (columns). We provide details of our implementation, which is also released as open source. The application is available at http://smarttable.cc.

SESSION: Demonstration Papers II

Interactive Symptom Elicitation for Diagnostic Information Retrieval

Medical information retrieval suffers from a dual problem: users struggle in describing what they are experiencing from a medical perspective and the search engine is struggling in retrieving the information exactly matching what users are experiencing. We demonstrate interactive symptom elicitation for diagnostic information retrieval. Interactive symptom elicitation builds a model from the user's initial description of the symptoms and interactively elicitates new information about symptoms by posing questions of related, but uncertain, symptoms for the user. As a result, the system interactively learns the estimates of symptoms while controlling the uncertainties related to the diagnostic process. The learned model is then used to rank the associated diagnoses that the user might be experiencing. Our preliminary experimental results show that interactive symptom elicitation can significantly improve user's capability to describe their symptoms, increase the confidence of the model, and enable effective diagnostic information retrieval.

Sover! Social Media Observer

The observation of social media provides an important complementing source of information about an unfolding event such as a crisis situation. For this purpose we have developed and demonstrate Sover!, a system to monitor real-time dynamic events via Twitter targeting the needs of aid organizations. At its core it builds upon an effective adaptive crawler, which combines two social media streams in a Bayesian inference framework and after each time-window updates the probabilities of whether given keywords are relevant for an event. Sover! also exposes the crawling functionality so a user can actively influence the evolving selection of keywords. The crawling activity feeds a rich dashboard, which enables the user to get a better understanding of a crisis situation as it unfolds in real-time.

Combining Terrier with Apache Spark to create Agile Experimental Information Retrieval Pipelines

Experimentation using IR systems has traditionally been a procedural and laborious process. Queries must be run on an index, with any parameters of the retrieval models suitably tuned. With the advent of learning-to-rank, such experimental processes (including the appropriate folding of queries to achieve cross-fold validation) have resulted in complicated experimental designs and hence scripting. At the same time, machine learning platforms such as Scikit Learn and Apache Spark have pioneered the notion of an experimental pipeline , which naturally allows a supervised classification experiment to be expressed a series of stages, which can be learned or transformed. In this demonstration, we detail Terrier-Spark, a recent adaptation to the Terrier Information Retrieval platform which permits it to be used within the experimental pipelines of Spark. We argue that this (1) provides an agile experimental platform for information retrieval, comparable to that enjoyed by other branches of data science; (2) aids research reproducibility in information retrieval by facilitating easily-distributable notebooks containing conducted experiments; and (3) facilitates the teaching of information retrieval experiments in educational environments.

Dynamic Composition of Question Answering Pipelines with FRANKENSTEIN

Question answering (QA) systems provide user-friendly interfaces for retrieving answers from structured and unstructured data given natural language questions. Several QA systems, as well as related components, have been contributed by the industry and research community in recent years. However, most of these efforts have been performed independently from each other and with different focuses, and their synergies in the scope of QA have not been addressed adequately. FRANKENSTEIN is a novel framework for developing QA systems over knowledge bases by integrating existing state-of-the-art QA components performing different tasks. It incorporates several reusable QA components, employs machine learning techniques to predict best performing components and QA pipelines for a given question, and generates static and dynamic executable QA pipelines. In this paper, we illustrate different functionalities of FRANKENSTEIN for performing independent QA component execution, QA component prediction, given an input question as well as the static and dynamic composition of different QA pipelines.

A System for Efficient High-Recall Retrieval

The goal of high-recall information retrieval (HRIR) is to find all or nearly all relevant documents for a search topic. In this paper, we present the design of our system that affords efficient high-recall retrieval. HRIR systems commonly rely on iterative relevance feedback. Our system uses a state-of-the-art implementation of continuous active learning (CAL), and is designed to allow other feedback systems to be attached with little work. Our system allows users to judge documents as fast as possible with no perceptible interface lag. We also support the integration of a search engine for users who would like to interactively search and judge documents. In addition to detailing the design of our system, we report on user feedback collected as part of a 50 participants user study. While we have found that users find the most relevant documents when we restrict user interaction, a majority of participants prefer having flexibility in user interaction. Our work has implications on how to build effective assessment systems and what features of the system are believed to be useful by users.

HyPlag: A Hybrid Approach to Academic Plagiarism Detection

Current plagiarism detection systems reliably find instances of copied and moderately altered text, but often fail to detect strong paraphrases, translations, and the reuse of non-textual content and ideas. To improve upon the detection capabilities for such concealed content reuse in academic publications, we make four contributions: i) We present the first plagiarism detection approach that combines the analysis of mathematical expressions, images, citations and text. ii) We describe the implementation of this hybrid detection approach in the research prototype HyPlag. iii) We present novel visualization and interaction concepts to aid users in reviewing content similarities identified by the hybrid detection approach. iv) We demonstrate the usefulness of the hybrid detection and result visualization approaches by using HyPlag to analyze a confirmed case of content reuse present in a retracted research publication.

RecAdvisor: Criteria-based Ph.D. Supervisor Recommendation

This demo presents RecAdvisor, a prototype recommender system for finding and recommending potential Ph.D. supervisors for students by identifying different criteria to consider when selecting a supervisor.

Minority Report by Lemur: Supporting Search Engine with Virtual Reality

In this paper, we introduce a Virtual Reality (VR) search engine interface. Virtual reality has been explored in the game industry and in the multimedia community. When wearing a VR device, a realistic experience is simulated around the user. In the working environment, VR's potential is still understudied. As a first step to enable VR-supported working environment, we present a search engine with a virtual reality interface. In our system, users can read, search and interact with the search engine with novel experiences. They only need to use their hands to interact with digital content, just like what is shown in the "minority report" movie.

Hide-n-Seek: An Intent-aware Privacy Protection Plugin for Personalized Web Search

We develop Hide-n-Seek, an intent-aware privacy protection plugin for personalized web search. In addition to users' genuine search queries, Hide-n-Seek submits k cover queries and corresponding clicks to an external search engine to disguise a user's search intent grounded and reinforced in a search session by mimicking the true query sequence. The cover queries are synthesized and randomly sampled from a topic hierarchy, where each node represents a coherent search topic estimated by both n-gram and neural language models constructed over crawled web documents. Hide-n-Seek also personalizes the returned search results by re-ranking them based on the genuine user profile developed and maintained on the client side. With a variety of graphical user interfaces, we present the topic-based query obfuscation mechanism to the end users for them to digest how their search privacy is protected.

SESSION: SIRIP: Industry Days

Machine Learning @ Amazon

In this talk, I will first provide an overview of key problem areas where we are applying Machine Learning techniques within Amazon such as product demand forecasting, product search, and information extraction from reviews, and associated technical challenges. I will then talk about two specific applications where we use a variety of methods to learn semantically rich representations of data: question answering where we use deep learning techniques and product size recommendations where we use probabilistic models. Rajeev Rastogi is a Director of Machine Learning at Amazon where he is developing ML platforms and applications for the e-commerce domain. Previously, he was Vice President of Yahoo! Labs Bangalore and the founding Director of the Bell Labs Research Center in Bangalore, India. Rajeev is an ACM Fellow and a Bell Labs Fellow. He is active in the fields of databases, data mining, and networking, and has served on the program committees of several conferences in these areas. He currently serves on the editorial boards of the CACM, VLDB Journal and ACM Computing Surveys, and has been an Associate editor for IEEE Transactions on Knowledge and Data Engineering in the past. He has published over 125 papers, and holds over 50 patents. Rajeev received his B. Tech degree from IIT Bombay, and a PhD degree in Computer Science from the University of Texas, Austin.

Extracting Real-Time Insights from Graphs and Social Streams

Many streaming applications in social networks, communication networks, and information networks are built on top of large graphs Such large networks contain continuously occurring processes, which lead to streams of edge interactions and posts. For example, the messages sent by participants on Facebook to one another can be viewed as content-rich interactions along edges. Such edge-centric streams are referred to as graph streams or social streams. The aggregate volume of these interactions can scale up super-linearly with the number of nodes in the network, which makes the problem more pressing for rapidly growing networks. These continuous streams may be mined for useful insights. In these cases, real-time analysis is crucial because of the time-sensitive nature of the interactions. However, generalizing conventional mining applications to such graphs turns out to be a challenge because of the expensive nature of graph mining algorithms. We discuss recent advances in several graph mining applications like clustering, classification, link prediction, event detection, and anomaly detection in real-time graph streams.

Big Data at Didi Chuxing

Didi Chuxing is the largest ride-sharing platform in China, providing transportation services for over 400 million users. Every day, Didi Chuxing's platform generates over 100 TB worth of data, processes more than 40 billion routing requests, and produces over 15 billion location points. In this talk, I will explain how Didi Chuxing applies big data and AI technologies to analyze big transportation data and improve the travel experience for millions of users.

The City Brain: Towards Real-Time Search for the Real-World

A city is an aggregate of a huge amount of heterogeneous data. However, extracting meaningful values from that data remains challenging. City Brain is an end-to-end system whose goal is to glean irreplaceable values from big city data, specifically from videos, with the assistance of rapidly evolving AI technologies and fast-growing computing capacity. From cognition to optimization, to decision-making, from search to prediction and ultimately, to intervention, City Brain improves the way we manage the city, as well as the way we live in it. In this talk, firstly we will introduce current practices of the City Brain platform in a few cities in China, including what we can do to achieve the goal and make it a reality. Then we will focus on visual search technologies and applications that we can apply on the city data. Last, a few video demos will be shown, followed by highlighting a few future directions of city computing.

Causal Inference over Longitudinal Data to Support Expectation Exploration

Many people use web search engines for expectation exploration: exploring what might happen if they take some action, or how they should expect some situation to evolve. While search engines have databases to provide structured answers to many questions, there is no database about the outcomes of actions or the evolution of situations. The information we need to answer such questions, however, is already being recorded. On social media, for example, hundreds of millions of people are publicly reporting about the actions they take and the situations they are in, and an increasing range of events and activities experienced in their lives over time. In this presentation, we show how causal inference methods can be applied to such individual-level, longitudinal records to generate answers for expectation exploration queries.

Merchandise Recommendation for Retail Events with Word Embedding Weighted Tf-idf and Dynamic Query Expansion

To recommend relevant merchandises for seasonal retail events, we rely on item retrieval from marketplace inventory. With feedback to expand query scope, we discuss keyword expansion candidate selection using word embedding similarity, and an enhanced tf-idf formula for expanded words in search ranking.

Product Question Answering Using Customer Generated Content - Research Challenges

Alexa is an intelligent personal assistant developed by Amazon, that can provide many services through voice interaction such as music playback, news, question-answering, and on-line shopping. The Alexa shopping research team in Amazon is a new emerging group of scientists who investigate revolutionary shopping experience through Alexa, while devising new search paradigms beyond traditional catalog search.

Auto-completion for Question Answering Systems at Bloomberg

The Bloomberg Terminal is the leading source of information and news in the finance industry. Through hundreds of functions that provide access to a vast wealth of structured and semi-structured data, the terminal is able to satisfy a wide range of information needs. Users can find what they need by constructing queries, plotting charts, creating alerts, and so on. Until recently, most queries to the terminal were constructed through dedicated GUIs. For instance, if users wanted to screen for technology companies that met certain criteria, they would specify the criteria by filling out a form via a sequence of interactions with GUI elements such as drop-down lists, checkboxes, radio and toggle buttons, etc. To facilitate information retrieval in the terminal, we are equipping it with the ability to understand and answer queries expressed in natural language. Our QA (question answering) systems map structurally complex questions like the above to a logical meaning representation which can then be translated to an executable query language (such as SQL or SPARQL). At that point we can execute the queries against a suitable back end, obtain the results, and present them to the users. Adding a natural-language interface to a data repository introduces usability challenges of its own, chief amongst them being this: How can the user know what the system can and cannot understand and answer (without needing to undergo extensive training)? We can unpack this question into two separate parts: 1) How can we convey the full range of the system's abilities? 2) How can we convey its limitations? We use auto-complete as a tool to help meet both challenges. Specifically, the first question pertains to the general issue of discoverability: We want at least some of the suggested completions to act as vehicles for discovering data and functionality of which users may have not been previously aware. The second question pertains to expectation management. Naturally, no QA system can attain perfect performance; limiting factors include representational shortcomings and various kinds of incompleteness of the underlying data sources, as well as NLP technology limitations. We want to stop generating completions as a signal indicating that we are not able to understand and/or answer what is being typed.

Talent Search and Recommendation Systems at LinkedIn: Practical Challenges and Lessons Learned

In this talk, we present the overall system design and architecture, the challenges encountered in practice, and the lessons learned from the production deployment of the talent search and recommendation systems at LinkedIn. By presenting our experiences of applying techniques at the intersection of recommender systems, information retrieval, machine learning, and statistical modeling in a large-scale industrial setting and highlighting the open problems, we hope to stimulate further research and collaborations within the SIGIR community.

The Evolution of Content Analysis for Personalized Recommendations at Twitter

We present a broad overview of personalized content recommendations at Twitter, discussing how our approach has evolved over the years, represented by several generations of systems. Historically, content analysis of Tweets has not been a priority, and instead engineering efforts have focused on graph-based recommendation techniques that exploit structural properties of the follow graph and engagement signals from users. These represent "low hanging fruits" that have enabled high-quality recommendations using simple algorithms. As deployed systems have grown in maturity and our understanding of the problem space has become more refined, we have begun to look for other opportunities to further improve recommendation quality. We overview recent investments in content analysis, particularly named-entity recognition techniques built around recurrent neural networks, and discuss how they integrate with existing graph-based capabilities to open up the design space of content recommendation algorithms.

Large Scale Search Engine Marketing (SEM) at Airbnb

Airbnb is an online marketplace which connects hosts and guests all over the world. Our inventory includes over 4.5 million listings, which enable the travel of over 300 million guests. The growth team at Airbnb is responsible for helping travelers find Airbnb, in part by participating in ad auctions on major search platforms such as Google and Bing. In this talk, we will describe how ad- vertising efficiently on these platforms requires solving several information retrieval and machine learning problems, including query understanding, click value estimation, and realtime pacing of our expenditure.

Clova: Services and Devices Powered by AI

NAVER, Korea's No. 1 internet company, and LINE, the Japan-based global messenger platform, have teamed up to use their extensive online content databases (documents, maps, encyclopedia, etc.) and user information to develop an AI platform called Clova (Cloud-based Virtual Assistant). Clova's deep-learning based methods have made tremendous progress in image classification and intelligent Q&A response. For example, in Naver, users can procure product or location information by showing pictures. Keyword-based matching is enhanced using semantic analysis, and retrieving top links is changed to answering questions by refining user queries through dialogue. In addition to PC and mobile, Clova is evolving to enable a user to access the relevant information using multiple devices and interfaces. Line released a smart speaker called Clova Friends that includes Line's free voice-call function and infrared home-appliance control. IPTV set-top box powered-by the Clova AI platform which can recommend movies based on Naver's user reviews. In this talk, I will cover some efforts and challenges in understanding and satisfying users on each device with sophisticated natural language processing technologies. I will introduce the technically challenging problems that we are currently tackling and future AI developments.

Lessons from Building a Large-scale Commercial IR-based Chatbot for an Emerging Market

In this work, we highlight some interesting challenges faced when trying to build a large-scale commercial IR-based chatbot, Ruuh, for an emerging market like India which has unique characteristics such as high linguistic and cultural diversity, large section of young population and the second largest mobile market in the world. We set out to build a "human-like" AI agent which aspires to become the trusted friend of every Indian youth. To meet this objective, we realised that we need to think beyond the utilitarian notion of merely generating "relevant" responses and enable the agent to comprehend and meet a wider range of user social needs, like expressing happiness when user's favourite team wins, sharing a cute comment on showing the pictures of the user's pet and so on. The agent should also be well-versed with the informal language of the urban Indian youth which often includes slang and code-mixing across two or more languages (English and their native language). Finally, in order to be their trusted friend, the agent has to communicate with respect without offending their sentiments and emotions. Some of the above objectives pose significant research challenges in the areas of NLP, IR and AI. We take the audience through our journey of how we tackled some of the above challenges while building a large-scale commercial IR-based conversational agent. Our attempts to solve some of the above challenges have also resulted in some interesting research contributions in the form of publications and patents in the above areas. Our chat-bot currently has more than 1M users who have engaged in more than 70M conversations.

LessonWare: Mining Student Notes to Provide Personalized Feedback

A new educational service has been prototyped by Echo360 that uses natural language processing to analyze students notes and provide personalized recommendations on how to both improve note-taking and scaffold learning. The LessonWare middleware system uses computer-generated transcriptions from class captures and available student notes to identify key terms mentioned during class sessions. The combination of analyzed key terms and corresponding timestamps allows contextual linkages to be created between educational resources. Student notes are automatically augmented with corresponding moments in class captures, specific pages in the course eTextbook or open education resources or specific adaptive learning assets.

SESSION: Tutorials

Deep Learning for Matching in Search and Recommendation

Matching is the key problem in both search and recommendation, that is to measure the relevance of a document to a query or the interest of a user on an item. Previously, machine learning methods have been exploited to address the problem, which learns a matching function from labeled data, also referred to as "learning to match''. In recent years, deep learning has been successfully applied to matching and significant progresses have been made. Deep semantic matching models for search and neural collaborative filtering models for recommendation are becoming the state-of-the-art technologies. The key to the success of the deep learning approach is its strong ability in learning of representations and generalization of matching patterns from raw data (e.g., queries, documents, users, and items, particularly in their raw forms). In this tutorial, we aim to give a comprehensive survey on recent progress in deep learning for matching in search and recommendation. Our tutorial is unique in that we try to give a unified view on search and recommendation. In this way, we expect researchers from the two fields can get deep understanding and accurate insight on the spaces, stimulate more ideas and discussions, and promote developments of technologies. The tutorial mainly consists of three parts. Firstly, we introduce the general problem of matching, which is fundamental in both search and recommendation. Secondly, we explain how traditional machine learning techniques are utilized to address the matching problem in search and recommendation. Lastly, we elaborate how deep learning can be effectively used to solve the matching problems in both tasks.

Conducting Laboratory Experiments Properly with Statistical Tools: An Easy Hands-on Tutorial

This hands-on half-day tutorial consists of two 90-minute sessions. Part I covers the following topics: paired and two-sample t -tests, confidence intervals (with Excel and R); familywise error rate, multiple comparison procedures; ANOVA (with Excel and R); Tukey's HSD test, simultaneous confidence intervals (with R). Part II covers the following topics: randomised Tukey HSD test (with Discpower); what's wrong with statistical significance tests?; effect sizes, statistical power; topic set size design (with Excel); power analysis (with R); summary: how to report your results. Participants should have some prior knowledge about the very basics of statistical significance testing and are strongly encouraged to bring a laptop with R already installed. The tutorial participants will be able to design and conduct statistical significance tests for comparing the mean effectiveness scores of two or more systems appropriately, and to report on the test results in an informative manner.

Neural Approaches to Conversational AI

This tutorial surveys neural approaches to conversational AI that were developed in the last few years. We group conversational systems into three categories: (1) question answering agents, (2) task-oriented dialogue agents, and (3) social bots. For each category, we present a review of state-of-the-art neural approaches, draw the connection between neural approaches and traditional symbolic approaches, and discuss the progress we have made and challenges we are facing, using specific systems and models as case studies.

Generative Adversarial Nets for Information Retrieval: Fundamentals and Advances

Generative adversarial nets (GANs) have been widely studied during the recent development of deep learning and unsupervised learning. With an adversarial training mechanism, GAN manages to train a generative model to fit the underlying unknown real data distribution under the guidance of the discriminative model estimating whether a data instance is real or generated. Such a framework is originally proposed for fitting continuous data distribution such as images, thus it is not straightforward to be directly applied to information retrieval scenarios where the data is mostly discrete, such as IDs, text and graphs. In this tutorial, we focus on discussing the GAN techniques and the variants on discrete data fitting in various information retrieval scenarios. (i) We introduce the fundamentals of GAN framework and its theoretic properties; (ii) we carefully study the promising solutions to extend GAN onto discrete data generation; (iii) we introduce IRGAN, the fundamental GAN framework of fitting single ID data distribution and the direct application on information retrieval; (iv) we further discuss the task of sequential discrete data generation tasks, e.g., text generation, and the corresponding GAN solutions; (v) we present the most recent work on graph/network data fitting with node embedding techniques by GANs. Meanwhile, we also introduce the relevant open-source platforms such as IRGAN and Texygen to help audience conduct research experiments on GANs in information retrieval. Finally, we conclude this tutorial with a comprehensive summarization and a prospect of further research directions for GANs in information retrieval.

Information Discovery in E-commerce: Half-day SIGIR 2018 Tutorial

E-commerce (electronic commerce or EC) is the buying and selling of goods and services, or the transmitting of funds or data online. E-commerce platforms come in many kinds, with global players such as Amazon, Airbnb, Alibaba, eBay, JD.com and platforms targeting specific markets such as Bol.com and Booking.com. Information retrieval has a natural role to play in e-commerce, especially in connecting people to goods and services. Information discovery in e-commerce concerns different types of search (exploratory search vs. lookup tasks), recommender systems, and natural language processing in e-commerce portals. Recently, the explosive popularity of e-commerce sites has made research on information discovery in e-commerce more important and more popular. There is increased attention for e-commerce information discovery methods in the community as witnessed by an increase in publications and dedicated workshops in this space. Methods for information discovery in e-commerce largely focus on improving the performance of e-commerce search and recommender systems, on enriching and using knowledge graphs to support e-commerce, and on developing innovative question-answering and bot-based solutions that help to connect people to goods and services. Below we describe why we believe that the time is right for an introductory tutorial on information discovery in e-commerce, the objectives of the proposed tutorial, its relevance, as well as more practical details, such as the format, schedule and support materials.

Fusion in Information Retrieval: SIGIR 2018 Half-Day Tutorial

Fusion is an important and central concept in Information Retrieval. The goal of fusion methods is to merge different sources of information so as to address a retrieval task. For example, in the adhoc retrieval setting, fusion methods have been applied to merge multiple document lists retrieved for a query. The lists could be retrieved using different query representations, document representations, ranking functions and corpora. The goal of this half day, intermediate-level, tutorial is to provide a methodological view of the theoretical foundations of fusion approaches, the numerous fusion methods that have been devised and a variety of applications for which fusion techniques have been applied.

Utilizing Knowledge Graphs for Text-Centric Information Retrieval

The past decade has witnessed the emergence of several publicly available and proprietary knowledge graphs (KGs). The depth and breadth of content in these KGs made them not only rich sources of structured knowledge by themselves, but also valuable resources for search systems. A surge of recent developments in entity linking and entity retrieval methods gave rise to a new line of research that aims at utilizing KGs for text-centric retrieval applications. This tutorial is the first to summarize and disseminate the progress in this emerging area to industry practitioners and researchers.

SIGIR 2018 Tutorial on Health Search (HS2018): A Full-day from Consumers to Clinicians

The HS2018 tutorial will cover topics from an area of information retrieval (IR) with significant societal impact --- health search. Whether it is searching patient records, helping medical professionals find best-practice evidence, or helping the public locate reliable and readable health information online, health search is a challenging area for IR research with an actively growing community and many open problems. This tutorial will provide attendees with a full stack of knowledge on health search, from understanding users and their problems to practical, hands-on sessions on current tools and techniques, current campaigns and evaluation resources, as well as important open questions and future directions.

A Tutorial on Probabilistic Topic Models for Text Data Retrieval and Analysis

As text data continues to grow quickly, it is increasingly important to develop intelligent systems to help people manage and make use of vast amounts of text data ("big text data''). As a new family of effective general approaches to text data retrieval and analysis, probabilistic topic models---notably Probabilistic Latent Semantic Analysis (PLSA), Latent Dirichlet Allocations (LDA), and their many extensions---have been studied actively in the past decade with widespread applications. These topic models are powerful tools for extracting and analyzing latent topics contained in text data; they also provide a general and robust latent semantic representation of text data, thus improving many applications in information retrieval and text mining. Since they are general and robust, they can be applied to text data in any natural language and about any topics. This tutorial systematically reviews the major research progress in probabilistic topic models and discuss their applications in text retrieval and text mining. The tutorial provides (1) an in-depth explanation of the basic concepts, underlying principles, and the two basic topic models (i.e., PLSA and LDA) that have widespread applications, (2) an introduction to EM algorithms and Bayesian inference algorithms for topic models, (3) a hands-on exercise to allow the tutorial attendants to learn how to use the topic models implemented in the MeTA Open Source Toolkit and experiment with provided data sets, (4) a broad overview of all the major representative topic models that extend PLSA or LDA, and (5) a discussion of major challenges and future research directions.

Knowledge Extraction and Inference from Text: Shallow, Deep, and Everything in Between

Systems for structured knowledge extraction and inference have made giant strides in the last decade. Starting from shallow linguistic tagging and coarse-grained recognition of named entities at the resolution of people, places, organizations, and times, modern systems link billions of pages of unstructured text with knowledge graphs having hundreds of millions of entities belonging to tens of thousands of types, and related by tens of thousands of relations. Via deep learning, systems build continuous representations of words, entities, types, and relations, and use these to continually discover new facts to add to the knowledge graph, and support search systems that go far beyond page-level "ten blue links''. We will present a comprehensive catalog of the best practices in traditional and deep knowledge extraction, inference and search. We will trace the development of diverse families of techniques, explore their interrelationships, and point out various loose ends.

Efficient Query Processing Infrastructures: A half-day tutorial at SIGIR 2018

Typically, techniques that benefit effectiveness of information retrieval (IR) systems have a negative impact on efficiency. Yet, with the large scale of Web search engines, there is a need to deploy efficient query processing techniques to reduce the cost of the infrastructure required. This tutorial aims to provide a detailed overview of the infrastructure of an IR system devoted to the efficient yet effective processing of user queries. This tutorial guides the attendees through the main ideas, approaches and algorithms developed in the last 30 years in query processing. In particular, we illustrate, with detailed examples and simplified pseudo-code, the most important query processing strategies adopted in major search engines, with a particular focus on dynamic pruning techniques. Moreover, we present and discuss the state-of-the-art innovations in query processing, such as impact-sorted and blockmax indexes. We also describe how modern search engines exploit such algorithms with learning-to-rank (LtR) models to produce effective results, exploiting new approaches in LtR query processing. Finally, this tutorial introduces query efficiency predictors for dynamic pruning, and discusses their main applications to scheduling, routing, selective processing and parallelisation of query processing, as deployed by a major search engine.

SESSION: Workshops

SIGIR 2018 Workshop on eCommerce (ECOM18)

eCommerce Information Retrieval has received little attention in the academic literature, yet it is an essential component of some of the largest web sites (such as eBay, Amazon, Airbnb, Alibaba, Taobao, Target, Facebook, and others). SIGIR has for several years seen sponsorship from these kinds of organisations, who clearly value the importance of research into Information Retrieval. The purpose of this workshop is to bring together researchers and practitioners of eCommerce IR to discuss topics unique to it, to set a research agenda, and to examine how to build datasets for research into this fascinating topic. eCommerce IR is ripe for research and has a unique set of problems. For example, in eCommerce search there may be no hypertext links between documents (products); there is a click stream, but more importantly, there is often a buy stream. eCommerce problems are wide in scope and range from user interaction modalities (the kinds of search seen in when buying are different from those of web-page search (i.e. it is not clear how shopping and buying relate to the standard web-search interaction models)) through to dynamic updates of a rapidly changing collection on auction sites, and the experienceness of some products (such as Airbnb bookings). This workshop is a follow up to the "SIGIR 2017 workshop on eCommerce (ECOM17)", which was organized at SIGIR 2017, Tokyo. In the 2018 workshop, in addition to a data challenge, we will be following up on multiple aspects that were discussed in the 2017 workshop.

SIGIR 2018 Workshop on ExplainAble Recommendation and Search (EARS 2018)

Explainable recommendation and search attempt to develop models or methods that not only generate high-quality recommendation or search results, but also intuitive explanations of the results for users or system designers, which can help to improve the system transparency, persuasiveness, trustworthiness, and effectiveness, etc. This is even more important in personalized search and recommendation scenarios, where users would like to know why a particular product, web page, news report, or friend suggestion exists in his or her own search and recommendation lists. The motivation of the workshop is to promote the research and application of Explainable Recommendation and Search, under the background of Explainable AI in a more general sense. Early recommendation and search systems adopted intuitive yet easily explainable models to generate recommendation and search lists, such as user-based and item-based collaborative filtering for recommendation, which provide recommendations based on similar users or items, or TF-IDF based retrieval models for search, which provide document ranking lists according to word similarity between different documents. However, state-of-the-art recommendation and search models extensively rely on complex machine learning and latent representation models such as matrix factorization or even deep neural networks, and they work with various types of information sources such as ratings, text, images, audio or video signals. The complexity nature of state-of-the-art models make search and recommendation systems as blank-boxes for end users, and the lack of explainability weakens the persuasiveness and trustworthiness of the system for users, making explainable recommendation and search important research issues to the IR community. In a broader sense, researchers in the whole artificial intelligence community have also realized the importance of Explainable AI, which aims to address a wide range of AI explainability problems in deep learning, computer vision, automatic driving systems, and natural language processing tasks. As an important branch of AI research, this further highlights the importance and urgency for our IR/RecSys community to address the explainability issues of various recommendation and search systems.

Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2018)

The large scale of scholarly publications poses a challenge for scholars in information seeking and sensemaking. Information retrieval~(IR), bibliometric and natural language processing (NLP) techniques could enhance scholarly search, retrieval and user experience but are not yet widely used. To this purpose, we propose the third iteration of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL). The workshop is intended to stimulate IR, NLP researchers and Digital Library professionals to elaborate on new approaches in natural language processing, information retrieval, scientometrics, text mining and recommendation techniques that can advance the state-of-the-art in scholarly document understanding, analysis, and retrieval at scale. The BIRNDL workshop will incorporate multiple invited talks, paper sessions, a poster session and the 4th edition of the Computational Linguistics (CL) Scientific Summarization Shared Task.

DATA: SEARCH'18 - Searching Data on the Web

This half day workshop explores challenges in data search, with a particular focus on data on the web. We want to stimulate an interdisciplinary discussion around how to improve the description, discovery, ranking and presentation of structured and semi-structured data, across data formats and domain applications. We welcome contributions describing algorithms and systems, as well as frameworks and studies in human data interaction. The workshop aims to bring together communities interested in making the web of data more discoverable, easier to search and more user friendly.

The Second Workshop on Knowledge Graphs and Semantics for Text Retrieval, Analysis, and Understanding (KG4IR)

Semantic technologies such as controlled vocabularies, thesauri, and knowledge graphs have been used throughout the history of information retrieval for a variety of tasks. Recent advances in knowledge acquisition, alignment, and utilization have given rise to a body of new approaches for utilizing knowledge graphs in text retrieval tasks and it is therefore time to consolidate the community efforts and study how such technologies can be employed in information retrieval systems in the most effective way. It is also time to start and deepen the dialogue between researchers and practitioners in order to ensure that breakthroughs, technologies, and algorithms in this space are widely disseminated. The goal of this workshop is to bring together and grow a community of researchers and practitioners who are interested in using, aligning, and constructing knowledge graphs and similar semantic resources for information retrieval applications.

Computational Surprise in Information Retrieval

The concept of surprise is central to human learning and development. However, compared to accuracy, surprise has received little attention in the IR community, yet it is an essential component of the information seeking process. This workshop brings together researchers and practitioners of IR to discuss the topic of computational surprise, to set a research agenda, and to examine how to build datasets for research into this fascinating topic. The themes in this workshop include discussion of what can be learned from some well-known surprise models in other fields, such as Bayesian surprise; how to evaluate surprise based on user experience; and how computational surprise is related to the newly emerging areas, such as fake news detection, computational contradiction, clickbait detection, etc.

First International Workshop on Professional Search (ProfS2018)

Professional search is a problem area in which many facets of information retrieval are addressed, both system-related (e.g. distributed search) and user-related (e.g. complex information needs), and the interface between user and system (e.g. supporting exploratory search tasks). Professional search tasks have specific requirements, different from the requirements of generic web search. The aim of this workshop is to bring together researchers to work on the requirements and challenges of professional search from different angles. We will have an interactive workshop where researchers not only present their scientific results but also work together on the definition of future challenges and solutions with input from information professionals. The workshop will deliver a roadmap of research directions for the years to come.

Second International Workshop on Conversational Approaches to Information Retrieval (CAIR'18): Workshop at SIGIR 2018

The CAIR'18 workshop will bring together academic and industrial researchers to create a forum for research on conversational approaches to search and recommendation. A specific focus will be on techniques that support complex and multi-turn user-machine dialogues for information access and retrieval, and multi-modal interfaces for interacting with such systems.

SIGIR 2018 Workshop on Learning from Limited or Noisy Data for Information Retrieval

In recent years, machine learning approaches, and in particular deep neural networks, have yielded significant improvements on several natural language processing and computer vision tasks; however, such breakthroughs have not yet been observed in the area of information retrieval. Besides the complexity of IR tasks, such as understanding the user's information needs, a main reason is the lack of high-quality and/or large-scale training data for many IR tasks. This necessitates studying how to design and train machine learning algorithms where there is no large-scale or high-quality data in hand. Therefore, considering the quick progress in development of machine learning models, this is an ideal time for a workshop that especially focuses on learning in such an important and challenging setting for IR tasks. The goal of this workshop is to bring together researchers from industry---where data is plentiful but noisy---with researchers from academia---where data is sparse but clean to discuss solutions to these related problems.

SIGIR 2018 Workshop on Intelligent Transportation Informatics

We propose a half-day workshop at SIGIR 2018 for the professionals, researchers, and practitioners who are interested in mining and understanding big and heterogeneous data generated in transportation to improve the transportation system. We plan to have both paper presentations and invited talks.

SESSION: Doctoral Consortium

Exploring Potential Pathways to Address Bias and Ethics in IR

The interplay between human biases and the underlying data collection and algorithmic methods to present users with relevant information in information retrieval (IR) systems have undesirable side effects, such as filter bubbles, censorship and developing beliefs in false information. Previous work in the areas of interactive information retrieval, document classification, behavioral economics and user profiling provide the foundation for our research. Using existing knowledge about human bias and profile data, we propose leveraging this information to raise awareness to users about their behavior in the frame of IR systems and inferences made. It is our goal to understand to what extent user behavior is changed. Our position is that education and awareness are much better approaches to address the ethical and human rights concerns when compared to regulatory measures and non-transparent changes to IR algorithms. It is believed the approach outlined below has the potential to dampen the effects of filter bubbles, reduce consumption of misleading and potentially hateful content, to broaden perspectives and protect the fundamental human right to freedom of expression.

SmartTable: Equipping Spreadsheets with Intelligent AssistanceFunctionalities

Tables are one of those "universal tools'' that are practical and useful in many application scenarios. Tables can be used to collect and organize information from multiple sources and then turn that information into knowledge (and ultimately to support decision-making) by performing various operations, like sorting, filtering, and joins. Because of this, a large number of tables exist already out there on the Web, which represent a vast and rich source of structured information and could be utilized as resources. Recently, a growing body of work has begun to tap into utilizing the knowledge contained in tables. A wide and diverse range of tasks have been undertaken, including but not limited to (i) searching for tables[4], (ii) extracting knowledge from tables, and (iii) augmenting tables (e.g., with new columns and rows[1,3] ).

The objective of this research is to develop a set of components for a tool called SmartTable, which is aimed at assisting the user in completing a complex task by providing intelligent assistance for working with tables. Imagine the scenario that a user is working with a table, and has already entered some data in the table. We can provide recommendations for the empty table cells, search for similar tables that can serve as a blueprint, or even generate automatically the entire table that the user needs. The table-making task can thus be simplified into just a few button clicks. Motivated by the above scenario, we propose a set of novel tasks such as row and column heading population, table search, and table generation. The following specific research questions are addressed: ( RQ1 ) How to populate table rows and column heading labels? ( RQ2 ) How to find relevant tables given a keyword query? ( RQ3 ) How to find tables relevant to the table the user is currently working on? ( RQ4 ) How to generate an output table as response to a free text query?

For RQ1, the task of row population [1,3] relates to the task of entity set expansion, where a given set of entities is to be completed with additional entities. Row population focuses on populating entities in the "core column'' of a relational table. We develop a two-step pipeline for this task utilizing a table corpus and a knowledge base. In the first step, candidate entities sharing the same categories with seed entities or co-occurring in similar tables are selected. In the second step, they are ranked by a probabilistic model. Column population shares similarities with the problem of schema complement, where a seed table is to be extended with additional columns. For column population, we regard column headings from similar tables as candidates and rank them using a probabilistic model.

For RQ2 and RQ3, we address the problem of table search. This task is not only interesting on its own but is also being used as a fundamental building block in many other table-based information access scenarios, such as table completion or table mining. To search related tables, the query could be some keywords [2,4] or it can also be an existing (incomplete) table. Based on the query type, this task is divided into two sub-tasks, which are table retrieval for keyword query and query-by-table respectively.

For RQ4, we introduce and address the task of the on-the-fly table generation: given a query, generate a relational table that contains relevant entities (as rows) along with their key properties (as columns) [5]. In terms of the table elements in a relational table, this task boils downing to core column entity ranking, schema determination and value look-up. We propose a feature-based approach for entity ranking and schema determination, combing deep semantic features with task-specific signals. For value lookup, we combine information from existing tables and a knowledge base.

So far, we have proposed methods and evaluation resources for addressing the tasks of row/column population, table search, and table generation. Future research directions for this project include looking up table values, interacting with tables using natural language, and generating table embeddings.

Efficiency-Effectiveness Trade-Offs in Machine Learned Models for Information Retrieval

Case-Based Retrieval Using Document-Level Semantic Networks

We propose a research that aims at improving the effectiveness of case-based retrieval systems through the use of automatically created document-level semantic networks. The proposed research leverages the recent advancements in information extraction and relational learning to revisit and advance the core ideas of concept-centered hypertext models. The automatic extraction of semantic relations from documents --- and their centrality in the creation and exploitation of the documents' semantic networks --- represents our attempt to go one step further than previous approaches.

Utilizing Inter-Passage Similarities for Focused Retrieval

Our main goal is studying the merits of using inter-passage similarities for the task of focused retrieval; i.e., ranking passages in documents by their relevance to an information need expressed by a query. As an initial research direction we study the cluster hypothesis for passage (focused) retrieval. We propose a novel suite of cluster hypothesis tests that employ inter-passage similarities and demonstrate that the cluster hypothesis holds for passages. In addition, we present several future directions we intend to pursue.

Enhanced Contextual Recommendation using Social Media Data

Making recommendations to a user which are more refined, personalized and contextually appropriate, has become an important problem in a variety of domains. Contextual Point of Interest (POI) recommendation is of particular interest for travel and tourism, to suggest places to visit for a user traveling to a new city. In this paper, I propose an approach which frames contextual POI recommendation as a traditional document ranking problem, where I represent each POI as a document. A significant aspect of this research investigates the generation of an enriched document representation for a POI, by harvesting information such as descriptions, users' reviews, and ratings from the web. I propose a log-linear retrieval model for document ranking using Kernel Density Estimation (KDE) which will eventually rank POIs. Additionally, I propose to use social media, such as Twitter, to supplement the recommendation process. People's spontaneous and unfiltered opinions about a POI give valuable information about that POI which is different from other online reviews. This unique type of user data will be used to enrich the document representation for POIs, and it is possible to incorporate the effect of social media into my retrieval model. I follow the TREC Contextual Suggestion track setup for my evaluation methodology.

A Semantic Search Approach to Task-Completion Engines

Web search is a human experience of a very rapid and impactful evolution. It has become a key technology on which people rely daily for getting information about almost everything. This evolution of the search experience has also shaped the expectations of people about it. The user sees the search engine as a wise interpreter capable of understanding the intent and meaning behind a search query, realizing her current context, and responding to it directly and appropriately[1]. Search by meaning, or semantic search, rather than just literal matches, became possible by a large portion of IR research devoted to study semantically more meaningful representations of the information need expressed by the user query. Major commercial search engines have indeed responded to user expectations, capitalizing on query semantics, or query understanding. They introduced features which not only provide information directly but also engage the user to stay interacting in the search engine result page (SERP). Direct displays (weather, flight offers, exchange rates, etc.), rich vertical content (images, videos, news, etc.), and knowledge panels, are examples of this recent evolution trend into answer engines [10].

Our notion of semantics is inherently based on the one of structure. Given the large portion of web search queries looking for entities[11], entities and their properties---attributes, types, and relationships---are first-class citizens in our space of structured knowledge. Semantic search can then be seen as a rich toolbox. Multiple techniques recognize these essential knowledge units in queries, identify them uniquely in underlying knowledge repositories, and exploit them to address a particular aspect of query understanding[2]. Query recommendations are another remarkable approach, whose suggestion feedback points to provide hints for improving the articulation of the search query.

>This research focuses on utilizing techniques from semantic search in the next evolution stage of search engines, namely, the support for task completion. Search is usually performed with a specific goal underlying the query. This goal, in many cases, consists in a nontrivial task to be completed. Indeed, task-based search corresponds to a considerable portion of query volume[5]. Addressing task-based search can have a large impact on search behavior, yet the interaction processes behind performing a complex task are very far to be fully understood. [12] Current search engines allow to solve a small set of basic tasks, and most of the knowledge-intensive workload for supporting more complex tasks is on the user. The ultimate challenge is then to build useful systems "to achieve work task completion''. [13] Rather than modeling explicitly search tasks, we strive for extending and enhancing solid strategies of semantic search to help users achieve their tasks.

One component we focus on in this research is utilizing entity type information, to gain a better understanding of how entity type information can be exploited in entity retrieval. [7,9] The second component is concerned with understanding query intents. Specifically, understanding what entity-oriented queries ask for, and how they can be fulfilled. [8] The third component is about generating query suggestions to support task-based search [4,6]. The search goal, often complex and knowledge-intensive, may lead the user to issue multiple queries to eventually complete her underlying task. We envisage the capability of the three identified components to complement each other for supporting task completion.

Addressing News-Related Standing Information Needs

A user with a standing need for updates on current events uses a structured exploration process for finding and reviewing new documents, with the user comparing document information to her mental model. To avoid missing key changes on the topic, the user should see some documents on each of the subtopics available that day. This research includes a system and evaluation approach for this standing need use case.

Improving Systematic Review Creation With Information Retrieval

Systematic reviews, in particular medical systematic reviews, are time consuming and costly to produce but are of value for clinical decision making, policy, and regulations. The largest contributing factors to the time and monetary costs are the searching (including the formulation of queries) and screening processes. These initial processes involve researchers reading the abstracts of thousands and sometimes hundreds of thousands of research articles to determine if the retrieved articles should be included or excluded from the systematic review. This research explores automatic methodologies to reduce the workload relating to the searching and initial screening processes. The objective of this research is to use Information Retrieval techniques to improve the retrieval of literature for medical systematic reviews.

Design and Evaluation of Query Auto Completion Mechanisms

Better Textbooks with Human Language Technology

Almost every part of the world relies on textbooks as the primary medium of imparting education. The quality of education, thus, is correlated with the quality of textbooks. In general, the quality of content in the textbooks used in the less-developed countries of the world is not up to the mark [1]. In addition to their intended purpose of delivering information, textbooks also promote behaviours that adults wish to pass on to the next generation [7]. It is, thus, important to ensure that textbooks are helpful in effective learning and do not condone undesirable social mores. The task of evaluating textbooks against these parameters is not trivial: experts must go through the entire content manually. This exercise being not only laborious, but also expensive [3]. This thesis attempts to propose a feasible computational alternative to this process

The dataset that we are working with is comprised of school textbooks collected from different boards of education of two south-east Asian countries that are widely regarded as 'developing countries'[4]. In general, a digitized textbook throws a large number of computational problems that require ideas from a number of disciplines such as Natural Language Processing, Information Retrieval, Human Computer Interaction and Cognitive Science. In this thesis we focus on the following research questions.

Research Question 1 How can we automatically identify the text fragments that reflect gender bias from textbooks?

The prevalence of gender bias has been reported in textbooks from many parts of the world [5,6]. Computational efforts to contain the effects of such biases have been proposed [2], but the detection of presence of gender bias, and identifying text passages that exhibit such biases is an unexplored problem. We model the task as a binary classification problem, with one class being biased text, and the other being unbiased text. For classifying text fragments, we propose a boosted memory-based model to learn the hidden patterns of biases from a small amount of labelled data. We are currently working on the sub-problem of identifying the gender of named human entities in textbooks in the absence of clear linguistic markers of gender such as 'sister', 'himself', etc. Our methodology is based on leveraging contextual data in the form of phrase vectors, along with sentence structure.

Research Question 2 How can we enhance the learning experience of a student through an optimal ordering of concepts, and ensure the coverage of all necessary topics?

We propose to represent textbook sections in the form of concept graphs [8], and utilize the linked structure of Wikipedia to determine necessary and sufficient concepts whose inclusion ensures completeness of the graph, i.e., prerequisite concepts [3] are not left out, while minimizing redundancy. For example, on the event of the presence of outgoing edges from an excluded concept C 1 to a concept C 2 included in the textbook, we suggest that C 1 be also included to ensure a comprehensive coverage of the subject matter. We also propose assigning weights to these edges, denoting the amount of dependence a concept has on another, to facilitate our aim of finding an optimal ordering of the concepts.

Research Question 3 How can we automatically identify and accumulate resources of the same difficulty level as the original text from external repositories, to facilitate betterunderstanding of the content among students?

Students often refer to online resources to better understand textbook content. A major hurdle associated with this practice is the wide mismatch in difficulty levels of the textbook, and the hits returned by a search engine [1]. Our aim is to learn to distinguish between texts of different levels of difficulty by using textbooks as the training data. We propose to apply a transfer learning-based approach to use the parameters learned during training to identify suitable web content. At present, we are planning to work with school-level basic science textbooks.

CHEERS: CHeap & Engineered Evaluation of Retrieval Systems

In test collection based evaluation of retrieval effectiveness, many research investigated different directions for an economical and a semi-automatic evaluation of retrieval systems. Although several methods have been proposed and experimentally evaluated, their accuracy seems still limited. In this paper we present our proposal for a more engineered approach to information retrieval evaluation.