Tutorial Program: Sunday, July 8


1. Conducting Laboratory Experiments Properly with Statistical Tools: An Easy Hands-on Tutorial

Tetsuya Sakai
Length: Half-day

This hands-on half-day tutorial consists of two 90-minute sessions. Part I covers the following topics: paired and two-sample t-tests, confidence intervals (with Excel and R); familywise error rate, multiple comparison procedures; one-way ANOVA (with Excel and R); Tukey’s HSD test, simultaneous confidence intervals (with R). Part II covers the following topics: randomised Tukey HSD test (with Discpower); what’s wrong with statistical significance tests?; effect sizes, statistical power; topic set size design (with Excel); power analysis (with R); summary: how to report your results. Participants should have some prior knowledge about the very basics of statistical significance testing and are strongly encouraged to bring a laptop with R already installed. The tutorial participants will be able to design and conduct statistical significance tests for comparing the mean effectiveness scores of two or more systems appropriately, and to report on the test results in an informative manner.

2. Fusion in Information Retrieval

Oren Kurland and J. Shane Culpepper
Length: Half-day

Fusion is a classic technique used for more than twenty years in Information Retrieval, specifically ad hoc (query-based) retrieval, that allows multiple sources of information to be combined into a single result set. Fusion can be collection-based, system-based (multiple-ranking algorithms), content-based, and even query-based when many similar queries express the same information need. The real power of fusion comes from the fact that even simple aggregation functions have the potential to provide enhanced retrieval effectiveness by exploiting the chorus effect.

In this tutorial, we will show that advances in fusion are directly applicable to current open problems in the Information Retrieval community, and that much can be learned from these models as machine learning becomes even more prominent in modern search solutions. In particular we draw parallels between unsupervised fusion and ensembles of classifiers in supervised learning.

We focus on retrieval settings where a single corpus is used, and different factors that affect retrieval vary; e.g., queries used to represent the information need, document and/or query representations, ranking functions, etc. We briefly discuss the se ing of retrieval over several corpora (a.k.a., federated or distributed search); specifically, we survey several state-of-the-art techniques for fusing lists retrieved from different corpora. We believe that federated search deserves a tutorial in its own right which covers the three main challenges: resource representation, resource selection and results merging.

Finally, it is important for everyone in the community to understand just how effective simple fusion techniques can be.

3. Deep Learning for Matching in Search and Recommendation

Jun Xu, Xiangnan He and Hang Li
Length: Half-day

Matching is the key problem in both search and recommendation and machine learning methods have been exploited to address the problem. In recent years deep learning have been exploited to address the problem and significant progresses have been made, including the deep semantic matching models for search and neural collaborative filtering models for recommendation. The key to the success of the deep learning approach is its strong ability in learning of representations and generalization of matching patterns from raw data (e.g., queries, documents, users, and items, particularly in their raw forms). In this tutorial, we aim to give a comprehensive survey on recent progress in deep learning for matching in search and recommendation. Our tutorial is unique in that we try to give a unified view on search and recommendation. By unifying the two tasks under the same view of matching and comparably reviewing existing techniques, we can provide more insights into solving the semantic matching problems.

This half-day tutorial targets both PhD students, researchers, and industry practitioners interested in learning or advancing their current knowledge of semantic matching for search and recommendation. Audiences of the tutorial are expected to have basic knowledge on search, recommendation, and machine learning. Those attending working on both search and recommendation can get deep understanding and accurate insight on the spaces, stimulate more ideas and discussions, and promote developments of technologies.

4. Generative Adversarial Nets for Information Retrieval: Fundamentals and Advances

Weinan Zhang
Length: Half-day

Generative adversarial nets, a.k.a. GANs, have been widely studied during the recent development of deep learning and unsupervised learning. In an adversarial training mechanism, GAN enables to train a generative model to fit the underlying unknown real data distribution under the guidance of the discriminative model estimating whether a data instance is real or generated. Such a framework is originally proposed for fitting continuous data distribution such as images, thus it is not straightforward to be directly applied to information retrieval scenarios where the data is mostly discrete, such as IDs, text and graphs. In this tutorial, we focus on discussing the GAN techniques and the variants on discrete data fitting in various information retrieval scenarios. (i) We introduce the fundamentals of GAN framework and its theoretic properties; (ii) we carefully study the promising solutions to extend GAN onto discrete data generation; (iii) we introduce IRGAN, the fundamental GAN framework for fitting single ID data distribution and the direct application on information retrieval; (iv) we further discuss the task of sequential discrete data generation tasks, e.g., text generation, and the corresponding GAN solutions; (v) we present the most recent work on graph/network data fitting with node embedding techniques by GANs. In parallel, we also introduce the relevant open-source platforms such as IRGAN and Texygen to help researchers conduct research on GANs in information retrieval. Finally, we conclude this tutorial with a comprehensive summarization and a prospect of further research directions for GANs in information retrieval.

5. Knowledge Extraction and Inference from Text: Shallow, Deep, and Everything in Between

Soumen Chakrabarti
Length: Full-day

Systems for structured knowledge extraction and inference have made giant strides in the last decade. Starting from shallow linguistic tagging and coarse-grained recognition of named entities at the resolution of people, places, organizations, and times, modern systems link billions of pages of unstructured text with knowledge graphs having hundreds of millions of entities belonging to tens of thousands of types, and related by tens of thousands of relations. Via deep learning, systems build continuous representations of words, entities, types, and relations, and use these to continually discover new facts to add to the knowledge graph, and support search systems that go far beyond page-level “ten blue links”. We will present a comprehensive catalog of the best practices in traditional and deep knowledge extraction, inference and search. We will trace the development of diverse families of techniques, explore their interrelationships, and point out various loose ends.

6. Probabilistic Topic Models for Text Data Retrieval and Analysis

Chengxiang Zhai and Chase Geigle
Length: Full-day

Text data include all kinds of natural language text such as web pages, news articles, scientific literature, emails, enterprise documents, and social media posts. In contrast to non-textual data which are usually generated by physical devices, text data are generated by humans and meant to be consumed by humans. Due to the rapid growth of text data, we can no longer digest all the relevant information in a timely manner. Thus there is a pressing need for developing intelligent software tools to help people manage and make use of vast amounts of text data (“big text data”) for various tasks, especially those involving complex decision-making. Logically, to harness big text data, we would need to first identify the relevant text data to a particular application problem (i.e., perform text data retrieval) and then analyze the identified relevant text data in more depth to extract any needed knowledge for a task (i.e. text data analysis).

Due to the difficulty in natural language understanding by computers, the approaches that work well for text retrieval and text analysis tend to be statistical approaches. In the past decade, a class of statistical approaches called probabilistic topic models— represented primarily by Probabilistic Latent Semantic Analysis (PLSA), Latent Dirichlet Allocation (LDA), and their numerous extensions—have been studied actively with widespread applications. These topic models provide a general and robust latent semantic representation of text data, thus improving many applications in information retrieval and text mining. Since they are general and robust, they can be applied to text data in any natural language with any topic coverage.

Recent years have seen increasing uses of topic models for solving information retrieval problems and developing techniques for new retrieval applications. Given that the PLSA paper is the top paper in the top-10 most-cited papers listed in the SIGIR section of ACM Digital Library, topic models likely will attract increasing attention in the SIGIR community. In SIGIR 2017, the first presenter of the proposed tutorial (ChengXiang Zhai) gave the first tutorial on this topic at SIGIR, which was very well attended. However, that tutorial was a half-day tutorial, which could only cover in detail the basic topic modeling techniques (PLSA/LDA and the EM algorithm) with the important topic of Bayesian inference algorithms of topic models left out. We now extend the tutorial to a full-day tutorial so as to further cover (1) algorithms for Bayesian inference of topic models, (2) more thorough discussion of applications of topic models, and (3) a hands-on exercise to allow the tutorial attendants to learn how to use the topic models implemented in the MeTA Open Source Toolkit (https://meta-toolkit.org/) and experiment with provided data sets.

7. SIGIR 2018 Tutorial on Health Search (HS2018) — A Full-day from Consumers to Clinicians

Guido Zuccon and Bevan Koopman
Length: Full-day

The HS2018 tutorial will cover topics from an area of information retrieval (IR) with significant societal impact — health search. Whether it is searching patient records, helping medical professionals find best-practice evidence, or helping the public locate reliable and readable health information online, health search is a challenging area for IR research with an actively growing community and many open problems. This tutorial will provide attendees with a full stack of knowledge on health search, from understanding users and their problems to practical, hands-on sessions on current tools and techniques, current campaigns and evaluation resources, as well as important open questions and future directions.


1. Tutorial on Utilizing Knowledge Graphs for Text-centric Information Retrieval

Laura Dietz, Alexander Kotov and Edgar Meij
Length: Half-day

The past decade has witnessed the emergence of several publicly available and proprietary knowledge graphs (KGs). The depth and breadth of content in these KGs made them not only rich sources of structured knowledge by themselves, but also valuable resources for Web search systems. A surge of recent developments in entity linking and entity retrieval methods gave rise to a new line of research that aims at utilizing KGs for text-centric retrieval applications. This tutorial is the first to summarize and disseminate the progress in this emerging area to industry practitioners and researchers.

2. Neural Approaches to Conversational AI

Jianfeng Gao, Michel Galley and Lihong Li
Length: Half-day

This tutorial introduces neural approaches to conversational AI that were developed in the last few years. We group conversational systems into three categories: (1) question answering agents, (2) task-oriented dialogue agents, and (3) social bots. For each category, we present a review of state-of-the-art neural approaches, draw the connection between neural approaches and traditional symbolic approaches, and discuss the progress we have made and challenges we are facing, using specific systems and models as case studies.

3. Efficient Query Processing Infrastructures

Nicola Tonellotto and Craig Macdonald
Length: Half-day

Typically, techniques that benefit effectiveness of information retrieval (IR) systems have a negative impact on efficiency. Yet, with the large scale of Web search engines, there is a need to deploy efficient query processing techniques to reduce the cost of the infrastructure required. This tutorial aims to provide a detailed overview of the infrastructure of an IR system devoted to the efficient yet effective processing of user queries. This tutorial will guide the attendees through the main ideas, approaches and algorithms developed in the last 30 years in query processing. In particular, we will illustrate, with detailed examples and simplified pseudo-code, the most important query processing strategies adopted in major search engines, with a particular focus on dynamic pruning techniques. Moreover, we will present and discuss the state-of-the-art innovations in query processing, such as impact-sorted and blockmax indexes. We will also describe how modern search engines exploit such algorithms with learning-to-rank (LtR) models to produce effective results, exploiting new approaches in LtR query processing. Finally, this tutorial will introduce query efficiency predictors for dynamic pruning, and discuss their main applications to scheduling, routing, selective processing and parallelisation of query processing, as deployed by a major search engine.

4. Information Discovery in E-commerce

Zhaochun Ren, Xiangnan He, Dawei Yin and Maarten de Rijke
Length: Half-day

E-commerce (electronic commerce or EC) is the buying and selling of goods and services, or the transmitting of funds or data online. E- commerce platforms come in many kinds, with global players such as Amazon, Airbnb, Alibaba, eBay, JD.com and platforms targeting specific markets such as Bol.com and Booking.com.

Information retrieval has a natural role to play in e-commerce, especially in connecting people to goods and services. Information discovery in e-commerce concerns different types of search (exploratory search vs. lookup tasks), recommender systems, and natural language processing in e-commerce portals. Recently, the explosive popularity of e-commerce sites has made research on information discovery in e-commerce more important and more popular. There is increased attention for e-commerce information discovery methods in the community as witnessed by an increase in publications and dedicated workshops in this space. Methods for information discovery in e-commerce largely focus on improving the performance of e-commerce relevance search and recommender systems, on enriching and using knowledge graphs to support e- commerce, and in developing innovative question-answering and bot-based solutions that help to connect people to goods and services.