IR Models: Foundations and Relationships
Bio:
Thomas Roelleke is an associate professor (senior lecturer) at Queen Mary University of London. His research expertise lies in IR Models & Theory and Probabilistic DB+IR. Currently, Thomas is a visiting scientist at Yahoo Research, Barcelona Media.
Thomas has presented IR model tutorials (at German summer
schools),
taught MSc modules on Foundations of IR, and
published several IR model & theory papers.
Summary:
This tutorial will make participants familiar with the foundations of and relationships between IR models. The objective is to present and discuss the theoretical concepts in an illustrative and interactive way.
The pre-requisites to enjoy this tutorial are
1. Interest in IR models and theory.
2. Beginner/medium knowledge about TF-IDF, LM, BM25, probability of relevance
and information theory (KL divergence).
An IR researcher asked: What is the motivation to propose such a tutorial? SIGIR participants should know the models.
Reviewers recommended: A tutorial of IR models is of core interest to SIGIR. The tutorial touches some controversial issues. Presenting theory is challenging.
Rather than trying to explain the motivation, here is a selection of statements made by researchers in talks, reviews, and discussions:
- It is still unclear why LM works; TF-IDF is intuitive, but LM is not.
- TF-IDF is heuristic, purely heuristic; LM has probabilistic semantics, performs better than TF-IDF, includes an IDF-ish component, therefore, TF-IDF is obsolete.
- TF-IDF is not a model.
- LM is full of holes and hacks.
- It has been shown that LM is proportional to the probability of relevance.
- Who uses a document-frequency based term probability in LM This is wrong!
- With respect to BM25, everything has been done and tested; there is no significant research topic to be explored.
The tutorial will help the participants to find their statements and answers to questions and opinions around IR models.
The content of the tutorial is structured into two main
parts:
Part I: Foundations of IR Models
Part II: Relationships between IR Models
An overall objective of the tutorial is to make the theory of IR models accessible
and interesting. After the tutorial, attendees will know the foundations and
relationships, and will develop their view regarding open research challenges
and controversial issues around IR models.
1 Foundations
In this part, we look step-by-step at the foundations where we take extra care to achieve a notation that is consistent across several models.
1.1 TF-IDF
TF-IDF is a model. We look at its probabilistic roots. We relate this to LM, and whether this is a probabilistic interpretation of TF-IDF (Hiemstra:2000).
1.2 Probability of Relevance Framework (PRF)
On P(r|d,q), its decomposition, the odds, and how all of this is related to binary independence, BM25, and not related to TF-IDF and LM.
1.3 BM25
We discuss the three main ingredients:
1. The so-called RSJ weight: w = P(t|r) P(not t| not r) / ( P(not t|r) P(t | not r) )
2. The BM25 TF: tf_d / (tf_d + K_d); K_d = k_1 ( b ( dl/avgdl ) + (1-b) )
3. Separate document length normalisation component
This part of the tutorial will be based very much on the BM25 tutorial by Robertson and Zaragoza, SIGIR 2007 and 2008.
We also review the 2-Poisson-based motivation (Robertson/Walker:94) of the BM25 TF.
1.4 Language Modelling (LM)
We look at the JM and Dirichlet mixtures and how they are related and motivated in general probability theory, and how this general concept transfers into the IR case.
1.5 The Vector Space Model
Since it was formulated with a tf-idf-based term weighting scheme, VSM and TF-IDF are viewed to be more closely related than VSM and other models.
Today, we view the VSM (and its generalisations) as a framework that can be used to express models, Wong/Yao:95, Rijsbergen:2004, Roelleke/etal:2006. Gianni Amati about the VSM: Specific to the VSM is the Euclidean norm; this is important.
1.6 Divergence from Randomness (DFR)
Is DFR a model or a framework What are the main components? How is DFR related to TF-IDF (and BM25)?
1.7 Selected Concepts of Probability and Information Theory
Total probability theorem and linked independence assumption.
Entropy, conditional entropy, and KL-divergence.
2 Relationships
2.1 Probabilistic Inference Networks (PIN&slquo;s)
P(q|d) = sum_t P(q|t) P(t|d). How is this formula related to IR models? We review the Turtle/Croft:90-92 work on PIN&slquo;s (and TF-IDF), and Metzler:2004 on PIN&slquo;s and LM.
2.2 Logical approach to IR and other frameworks
Rijsbergen:CJ:86, Wong/Yao:TOIS:95, Roelleke/etal:IPM:2006: Logic and matrices as frameworks to express IR models.
2.3 Relevance-based LM
How to exploit relevance data in LM? We look at the approach in Lavrenko/Croft:2001.
2.4 Event Spaces
What is the issue regarding event spaces We look at Robertson:2005 and Luk:2008. Also, we discuss the estimate of the background probability in LM, which leads us to the Poisson bridge to relate DF-based and TF-based probabilities.
2.5 Axiomatic approaches
Looking at different models, what are the common components/properties Fang/Zhai:SIGIR:2005
2.6 Information Theory
We review concepts such as entropy, conditional entropy, joint entropy, cross entropy, KL-divergence, query clarity/difficulty and document difficulty/clarity.
Also, we look at IDF and theoretical arguments, Robertson:2005, and review the relationship between PRF and IDF.
For more information visit http://www.eecs.qmul.ac.uk/~thor/2012/SIGIR-IR-Models-Tutorial.html