Tutorials | SIGIR 2018

Tutorials

SIGIR 2018 will feature 3 full-day tutorials and 8 half-day tutorials by distinguished researchers that span a diverse range of important topics in information retrieval.

Full day tutorials

Knowledge Extraction and Inference from Text: Shallow, Deep, and Everything in Between
Probabilistic Topic Models for Text Data Retrieval and Analysis
SIGIR 2018 Tutorial on Health Search (HS2018) — A Full-day from Consumers to Clinicians

Half day tutorials

Conducting Laboratory Experiments Properly with Statistical Tools: An Easy Hands-on Tutorial
Information Discovery in E-commerce
Deep Learning for Matching in Search and Recommendation
Generative Adversarial Nets for Information Retrieval: Fundamentals and Advances
Tutorial on Utilizing Knowledge Graphs for Text-centric Information Retrieval
Neural Approaches to Conversational AI
Efficient Query Processing Infrastructures
Fusion in Information Retrieval

Please note that SIGIR 2018 does not offer a registration option for workshop or tutorial-only participation.

Tutorial Program: Sunday, July 8

Morning:

1. Conducting Laboratory Experiments Properly with Statistical Tools: An Easy Hands-on Tutorial

Tetsuya Sakai
Length: Half-day

This hands-on half-day tutorial consists of two 90-minute sessions. Part I covers the following topics: paired and two-sample t-tests, confidence intervals (with Excel and R); familywise error rate, multiple comparison procedures; one-way ANOVA (with Excel and R); Tukey’s HSD test, simultaneous confidence intervals (with R). Part II covers the following topics: randomised Tukey HSD test (with Discpower); what’s wrong with statistical significance tests?; effect sizes, statistical power; topic set size design (with Excel); power analysis (with R); summary: how to report your results. Participants should have some prior knowledge about the very basics of statistical significance testing and are strongly encouraged to bring a laptop with R already installed. The tutorial participants will be able to design and conduct statistical significance tests for comparing the mean effectiveness scores of two or more systems appropriately, and to report on the test results in an informative manner.

2. Fusion in Information Retrieval

Oren Kurland and J. Shane Culpepper
Length: Half-day

Fusion is a classic technique used for more than twenty years in Information Retrieval, specifically ad hoc (query-based) retrieval, that allows multiple sources of information to be combined into a single result set. Fusion can be collection-based, system-based (multiple-ranking algorithms), content-based, and even query-based when many similar queries express the same information need. The real power of fusion comes from the fact that even simple aggregation functions have the potential to provide enhanced retrieval effectiveness by exploiting the chorus effect.

In this tutorial, we will show that advances in fusion are directly applicable to current open problems in the Information Retrieval community, and that much can be learned from these models as machine learning becomes even more prominent in modern search solutions. In particular we draw parallels between unsupervised fusion and ensembles of classifiers in supervised learning.

We focus on retrieval settings where a single corpus is used, and different factors that affect retrieval vary; e.g., queries used to represent the information need, document and/or query representations, ranking functions, etc. We briefly discuss the se ing of retrieval over several corpora (a.k.a., federated or distributed search); specifically, we survey several state-of-the-art techniques for fusing lists retrieved from different corpora. We believe that federated search deserves a tutorial in its own right which covers the three main challenges: resource representation, resource selection and results merging.

Finally, it is important for everyone in the community to understand just how effective simple fusion techniques can be.

3. Deep Learning for Matching in Search and Recommendation

Jun Xu, Xiangnan He and Hang Li
Length: Half-day

Matching is the key problem in both search and recommendation and machine learning methods have been exploited to address the problem. In recent years deep learning have been exploited to address the problem and significant progresses have been made, including the deep semantic matching models for search and neural collaborative filtering models for recommendation. The key to the success of the deep learning approach is its strong ability in learning of representations and generalization of matching patterns from raw data (e.g., queries, documents, users, and items, particularly in their raw forms). In this tutorial, we aim to give a comprehensive survey on recent progress in deep learning for matching in search and recommendation. Our tutorial is unique in that we try to give a unified view on search and recommendation. By unifying the two tasks under the same view of matching and comparably reviewing existing techniques, we can provide more insights into solving the semantic matching problems.

This half-day tutorial targets both PhD students, researchers, and industry practitioners interested in learning or advancing their current knowledge of semantic matching for search and recommendation. Audiences of the tutorial are expected to have basic knowledge on search, recommendation, and machine learning. Those attending working on both search and recommendation can get deep understanding and accurate insight on the spaces, stimulate more ideas and discussions, and promote developments of technologies.

4. Generative Adversarial Nets for Information Retrieval: Fundamentals and Advances

Weinan Zhang
Length: Half-day

Generative adversarial nets, a.k.a. GANs, have been widely studied during the recent development of deep learning and unsupervised learning. In an adversarial training mechanism, GAN enables to train a generative model to fit the underlying unknown real data distribution under the guidance of the discriminative model estimating whether a data instance is real or generated. Such a framework is originally proposed for fitting continuous data distribution such as images, thus it is not straightforward to be directly applied to information retrieval scenarios where the data is mostly discrete, such as IDs, text and graphs. In this tutorial, we focus on discussing the GAN techniques and the variants on discrete data fitting in various information retrieval scenarios. (i) We introduce the fundamentals of GAN framework and its theoretic properties; (ii) we carefully study the promising solutions to extend GAN onto discrete data generation; (iii) we introduce IRGAN, the fundamental GAN framework for fitting single ID data distribution and the direct application on information retrieval; (iv) we further discuss the task of sequential discrete data generation tasks, e.g., text generation, and the corresponding GAN solutions; (v) we present the most recent work on graph/network data fitting with node embedding techniques by GANs. In parallel, we also introduce the relevant open-source platforms such as IRGAN and Texygen to help researchers conduct research on GANs in information retrieval. Finally, we conclude this tutorial with a comprehensive summarization and a prospect of further research directions for GANs in information retrieval.

5. Knowledge Extraction and Inference from Text: Shallow, Deep, and Everything in Between

Soumen Chakrabarti
Length: Full-day

Systems for structured knowledge extraction and inference have made giant strides in the last decade. Starting from shallow linguistic tagging and coarse-grained recognition of named entities at the resolution of people, places, organizations, and times, modern systems link billions of pages of unstructured text with knowledge graphs having hundreds of millions of entities belonging to tens of thousands of types, and related by tens of thousands of relations. Via deep learning, systems build continuous representations of words, entities, types, and relations, and use these to continually discover new facts to add to the knowledge graph, and support search systems that go far beyond page-level “ten blue links”. We will present a comprehensive catalog of the best practices in traditional and deep knowledge extraction, inference and search. We will trace the development of diverse families of techniques, explore their interrelationships, and point out various loose ends.

6. Probabilistic Topic Models for Text Data Retrieval and Analysis

Chengxiang Zhai and Chase Geigle
Length: Full-day

Text data include all kinds of natural language text such as web pages, news articles, scientific literature, emails, enterprise documents, and social media posts. As text data continues to grow quickly, it is increasingly important to develop intelligent systems to help people manage and make use of vast amounts of text data (“big text data”). As a new family of effective general approaches to text data retrieval and analysis, probabilistic topic models, notably Probabilistic Latent Semantic Analysis (PLSA), Latent Dirichlet Allocations (LDA), and many extensions of them, have been studied actively in the past decade with widespread applications. These topic models are powerful tools for extracting and analyzing latent topics contained in text data; they also provide a general and robust latent semantic representation of text data, thus improving many applications in information retrieval and text mining. Since they are general and robust, they can be applied to text data in any natural language and about any topics. This tutorial will systematically review the major research progress in probabilistic topic models and discuss their applications in text retrieval and text mining. The tutorial will provide (1) an in-depth explanation of the basic concepts, underlying principles, and the two basic topic models (i.e., PLSA and LDA) that have widespread applications, (2) an introduction to EM algorithms and Bayesian inference algorithms for topic models, (3) a hands-on exercise to allow the tutorial attendants to learn how to use the topic models implemented in the MeTA Open Source Toolkit (https://meta-toolkit.org/) and experiment with provided data sets, (4) a broad overview of all the major representative topic models that extend PLSA or LDA, and (5) a discussion of major challenges and future research directions. The tutorial should be appealing to anyone who would like to learn about topic models, how and why they work, their widespread applications, and the remaining research challenges to be solved, including especially graduate students, researchers who want to develop new topic models, and practitioners who want to apply topic models to solve many application problems. The attendants are expected to have basic knowledge of probability and statistics.

7. SIGIR 2018 Tutorial on Health Search (HS2018) — A Full-day from Consumers to Clinicians

Guido Zuccon and Bevan Koopman
Length: Full-day

The HS2018 tutorial will cover topics from an area of information retrieval (IR) with significant societal impact — health search. Whether it is searching patient records, helping medical professionals find best-practice evidence, or helping the public locate reliable and readable health information online, health search is a challenging area for IR research with an actively growing community and many open problems. This tutorial will provide attendees with a full stack of knowledge on health search, from understanding users and their problems to practical, hands-on sessions on current tools and techniques, current campaigns and evaluation resources, as well as important open questions and future directions.

Afternoon

1. Tutorial on Utilizing Knowledge Graphs for Text-centric Information Retrieval

Laura Dietz, Alexander Kotov and Edgar Meij
Length: Half-day

The past decade has witnessed the emergence of several publicly available and proprietary knowledge graphs (KGs). The depth and breadth of content in these KGs made them not only rich sources of structured knowledge by themselves, but also valuable resources for Web search systems. A surge of recent developments in entity linking and entity retrieval methods gave rise to a new line of research that aims at utilizing KGs for text-centric retrieval applications. This tutorial is the first to summarize and disseminate the progress in this emerging area to industry practitioners and researchers.

2. Neural Approaches to Conversational AI

Jianfeng Gao, Michel Galley and Lihong Li
Length: Half-day

This tutorial introduces neural approaches to conversational AI that were developed in the last few years. We group conversational systems into three categories: (1) question answering agents, (2) task-oriented dialogue agents, and (3) social bots. For each category, we present a review of state-of-the-art neural approaches, draw the connection between neural approaches and traditional symbolic approaches, and discuss the progress we have made and challenges we are facing, using specific systems and models as case studies.

3. Efficient Query Processing Infrastructures

Nicola Tonellotto and Craig Macdonald
Length: Half-day

Typically, techniques that benefit effectiveness of information retrieval (IR) systems have a negative impact on efficiency. Yet, with the large scale of Web search engines, there is a need to deploy efficient query processing techniques to reduce the cost of the infrastructure required. This tutorial aims to provide a detailed overview of the infrastructure of an IR system devoted to the efficient yet effective processing of user queries. This tutorial will guide the attendees through the main ideas, approaches and algorithms developed in the last 30 years in query processing. In particular, we will illustrate, with detailed examples and simplified pseudo-code, the most important query processing strategies adopted in major search engines, with a particular focus on dynamic pruning techniques. Moreover, we will present and discuss the state-of-the-art innovations in query processing, such as impact-sorted and blockmax indexes. We will also describe how modern search engines exploit such algorithms with learning-to-rank (LtR) models to produce effective results, exploiting new approaches in LtR query processing. Finally, this tutorial will introduce query efficiency predictors for dynamic pruning, and discuss their main applications to scheduling, routing, selective processing and parallelisation of query processing, as deployed by a major search engine.

4. Information Discovery in E-commerce

Zhaochun Ren, Xiangnan He, Dawei Yin and Maarten de Rijke
Length: Half-day

E-commerce (electronic commerce or EC) is the buying and selling of goods and services, or the transmitting of funds or data online. E- commerce platforms come in many kinds, with global players such as Amazon, Airbnb, Alibaba, eBay, JD.com and platforms targeting specific markets such as Bol.com and Booking.com.

Information retrieval has a natural role to play in e-commerce, especially in connecting people to goods and services. Information discovery in e-commerce concerns different types of search (exploratory search vs. lookup tasks), recommender systems, and natural language processing in e-commerce portals. Recently, the explosive popularity of e-commerce sites has made research on information discovery in e-commerce more important and more popular. There is increased attention for e-commerce information discovery methods in the community as witnessed by an increase in publications and dedicated workshops in this space. Methods for information discovery in e-commerce largely focus on improving the performance of e-commerce relevance search and recommender systems, on enriching and using knowledge graphs to support e- commerce, and in developing innovative question-answering and bot-based solutions that help to connect people to goods and services.