CHAPTER 2

The selection of good search terms

Introduction This paper tackles the problem of how one might select further search terms, using relevance feedback, given the search terms in the query. The approach taken is based on an earlier paper by one of the authors [1] in which a theoretical model for the exploitation of statistical dependence between index terms was described. This model was evaluated in a later paper [2] showing the extent to which the use of statistical dependence information derived from co-occurring index terms would lead to an improvement in retrieval effectiveness. The experimental methodology used in the latter paper will also be adopted here, that is, the methods of evaluation and estimation will be the same except for minor variations which will be pointed out when appropriate. The general experimental set-up within which the ideas for the selection of search terms were worked out can be simply described. There are a number of test collections consisting of documents, and queries with associated relevance assessments. For each query the relevant documents are therefore known, so that a user's response to any output from a retrieval strategy may be simulated. (Sometimes the relevance assessments are not exhaustive. This happens when only a portion of the entire collection assumed to contain most of the relevant documents is scanned in determining the documents relevant to a query. Unassessed documents are then assumed to be non-relevant.) The basic relevance feedback strategy is one in which a simple strategy such as co-ordination level matching is used to retrieve an initial small set of (say 10 or 20) documents. The known relevant documents in this small set are then used to estimate certain parameters, which in turn are used to build up a new search function. This new search function will incorporate new search terms, not already occurring in the query, which are derived from a tree relating all the index terms in the entire collection. The tree structure is in the nature of a thesaurus although the links are statistically derived. It is this tree structure called the maximum spanning tree (or MST briefly) - see appendix - which is the main aid used for finding further search terms. The way in which the spanning tree is actually used during a retrieval run is not very different from earlier uses of term clustering to expand queries. The relationship between an MST and a commonly used clustering method, single-link, is explained in [4]. Thus it is not too difficult to interpret the MST for index terms as a term clustering. But it should be stressed that the tree structure does not imply a hierarchical relationship between the terms. In fact each of the connected terms are

-8-

informative about each other. The MST lies at the centre of the research described in this paper. It is this structure which captures the important dependencies between index terms. In the original model [1] for the use of statistical dependence between index terms, two such trees were envisaged, one for the terms in the relevant documents and one for the terms in the non-relevant documents. Subsequently this was approximated by assuming the same tree structure for both sets. In this paper we go one step further and, for the time being, abandon the attempt to construct probability distributions on both the relevant and non-relevant sets, until we can resolve the difficulties inherent in making an explicit assumption on both relevant and non-relevant sets of documents. Instead we simply try to expand queries by appropriate index terms in the MST, which itself is based on the distribution of cooccurrences in the entire collection. Preliminary experiments with this way of expanding queries were reported in our earlier paper [2]. To avoid confusion between the way the MST has been used in earlier work and the way it is used here we shall now briefly discuss the relationship between the MST and the underlying probability model. When faced with the problem of modelling the probability of relevance through distributional information about individual index terms one can make various assumptions about the independence or dependence of the index terms. A common one has been to assume that the index terms are independent on both the relevant and non-relevant sets of documents [3 3. In [1] a limited form of dependence on both these sets was assumed. To capture the important dependencies a spanning tree is constructed for each set. However, attempts to use this form of dependence model run into estimation and computational problems which remain to be solved. Instead we have sought a compromise solution, one which would enable us to use the dependence information and yet not have to estimate this from ridiculously small samples. In this paper we attempt to make use of the statistical dependence between index terms over the entire collection. Assuming dependence on the entire collection is consistent with assuming independence on both the relevant and non-relevant sets [4]. In [4] it was shown how this particular form of conditional independence can lead to sensible heuristics for expanding queries by index terms connected into an MST based on co-occurrence data derived from the entire collection. Our main concern in this paper is with the use of the MST connecting all index terms derived from distributional information about index terms in the entire collection. We use the spanning tree to expand the initial query. The effectiveness of this expansion is compared with that of the unexpanded query. The question naturally arises as to what is the !bestf spanning tree to use in this process. In [1] it was shown how the spanning tree might be constructed in an optimal way so that it produced the best approximation for the relevant probability functions. In that paper it was also conjectured that reasonable approximations to the optimal tree, although suboptimal, may yet give comparable performance in terms of retrieval effectiveness. That this is in fact so is demonstrated for several test collections.

-9-

The experimental results about the effectiveness of relevance feedback based on differently generated spanning trees are presented in the sequel. The feedback strategy evaluated is relatively simple, but constitutes a starting point for further research into more elaborate ways of using the spanning tree. The essential idea is to use a term contained in the query to lead us to further search terms. One can formulate the idea in terms of the Association Hypothesis [4]: If an index term is good at discriminating relevant from non-relevant documents then any closely associated index term is also likely to be good at this. When used without further qualification in expanding queries, one must obviously make the implicit assumption that query terms are good at discriminating relevant from non-relevant documents. Now although this is not always the case it is not unreasonable to assume it on the average. Thus we expand each query term by the terms that are closely associated with it, and those are to be found by looking at the adjacent terms in the MST, those which are connected to a query term. The union of the query and the sets of adjacent terms, one set for each query term, we called the expanded query in our previous paper [2]. Expanding the query terms by all the adjacent terms in the spanning tree is clearly not the only way of proceeding. One would like to be able to select in order of preference further search terms. This general theoretical problem of how one might choose search terms in order of decreasing discrimination power remains to be investigated. Before we can discuss our experimental results we must briefly describe the theoretical framework which has led to this work. Much of it may be found in Chapter 6 of Van Rijsbergen [4],

Basic Symbols The theoretical background to this work in IR derives from a straightforward application of Probability Theory including some simple use of statistical decision theory. To explain it we shall need to define a few symbols.

represents a document, where n is the number of index terms in the vocabulary and x^=0 when the ith term is absent and x^=1 when the ith term is present. We consider only two relevance categories, w*| : relevant v*2 i non-relevant The important probabilities we need to define are,

-10-

POO
is the probability of observing a document description _ within the x document collection irrespective of whether it is relevant or not. POcjw^ is the probability of observing ^ given that it represents a relevant (i=1) or non-relevant (i=2) document. P(wi!jc) is the probability that a document is relevant (i=1) or non-relevant (i=2) given its description £. P(w t ) is the prior probability of observing a relevant (i=1) or non-relevant (i=2) document. Obviously P(x) = P(jc|w1)P(w1) + P(_x!w2)P(w2)

Optimality The fundamental assumption made in all this work is that the distribution of descriptions on the relevant documents is different from the distribution of descriptions on the non-relevant documents and that the difference can be estimated and used to find relevant documents. The main quantity estimated for finding the relevant documents is P(w-j |_x) i.e. the probability of relevance for every document. The higher the probability the more likely we are to want to retrieve that document. (From now on documents will be identified with their descriptions unless the difference is important.) The simplest retrieval rule using these probabilities is given by the following, P(w-| |_x) > P(w2!_x) -> _ is relevant, x is non-relevant x . D1

This is a good rule for the following reason: it minimises the expected probability of misclassification (sometimes called the error rate). The probability of misclassification is given by P(error|_x) = P(w^!_x) if we decide w 2 P(w2!^() if we decide w<| So if for every document x we choose that w^ corresponding to the larger . of P(w-| Ijc) and P(w2!_x) then the choice will minimise P(error|_x) for each x. In doing so we will also minimise

-11-

\
P(error) = /_ P(error ijOPCjc)

which is the expected probability of misclassification. In C4] it is shown that this approach can be generalised to incorporate different costs associated with different errors. That is, we can associate a different cost with retrieving a non-relevant document from missing a relevant document. The retrieval rule will then be expressed in terms of expected cost, and will specify the choice leading to minimum expected cost. However, this generality is not required here: interested readers should consult [4], In practice the retrieval rule D1 is evaluated using Bayes1 Theorem: PUiw i )P(w i ) P(wi!_x) = POO To evaluate D1, P(w^|_x) is replaced by the R.H.S. of the equality in Th1. Furthermore, rather than estimate P C w ^ when evaluating the arguments of D1, we prefer to rank the documents by P(w-j |_x) thus obviating the need to set a cut-off. One can say more about the ranking than one would suspect at first sight. In fact there is an optimality principal now commonly known as the Probability Ranking Principle which states that ranking by P(w^ !_x) is optimal in the sense that at any fixed recall level the precision will be maximised [5]. A simple proof of this can be found in Harter [6]. Th1

Independence Let us look a little more closely at the retrieval rules that will result. Either way, whether we rank or use D1, we must estimate P(w-j !_x) by the R.H.S. of Th1. The usual assumption made is that the index terms are statistically independent. To say this just like that is actually ambiguous without precisely specifying the sets on which the independence holds,. So let us assume independence on both the relevant and non-relevant documents. Then we may write

-12-

POciw.,) = P(x1 iw-,) ... P(xniw.,) P(_x|w2) = P(x.,!w2) ... P(x n lw 2 ) Using these expressions we can derive the usual weighting functions [1] [2] [3], but let us first note that ranking w.r.t. P(w-| \x) is the same as ranking w.r.t. P(_x!w1)P(w1) log POciw 2 )P(w 2 ) This l a s t function i s g e n e r a l l y used to w r i t e down e x p l i c i t l y the weighting f u n c t i o n as f o l l o w s . Define p i = P ( x i = 1 iw.|) qi = P C x ^ l !w2) then

PU!W1) =

! i i=1

Xi Pi

(l-Pi)

"Xi

POcjw2) =

! ! i=1

Xi qi

(1-qi)

"

X i

Now substituting in the log function above we get n g(jc) = log ! !
1=1 1 Pi

n-qi)

x

X,

1-X,

P(w^ + log P(w 2 ) n \ = /_ c-x- + Const i=1 where

-13-

PiO-Qi) -i = log q l (1-p i ) This g(j<) is the linear weighting function and c^ are the weights that are estimated from the relevance information contained in a small set of documents retrieved by a simple strategy such as co-ordination level matching. For example if the set of documents retrieved is N and we display the relevance information in the usual contingency table, Relevant x i= 1 x^O R-i-i Non-relevant
n

i~ r i

n

i

N-n^R*^ N-R

N-nj N

we can derive the form of c^ used in [3], by using maximum likelihood estimates for p i and q ^ (i.e. p i = r^/R and qi = (n i -r i )/(N-R)) we . get,

r^CR-r^
>i = log (nj^-rj^J/CN-n^^-R+rj^) Although this form of c^ has proved successful, it causes difficulties when one or more of the interior cell values of the contingency table goes to zero, for then the log function is undefined. Robertson and Sparck Jones [3] sought to get around it by a well known statistical technique of adding .5 f s to the interior cells and adjusting the marginals accordingly. Unfortunately it does not solve the problem, for it grossly overestimates the probabilities involved. Of particular interest is the case when a term is not assigned to any relevant documents (r^=0). In the table below values for 100c^ are tabulated for a typical document collection: set N=1400, R=2, and the n^ and r^ values as shown in the table; adding .5!s to the interior cells and adjusting the marginals of the contingency table will result in the table as shown.

-14-

IOOCJ
n

i

ri=0

ri=1

r i= 2

25 50 75 100 125 150

238 168 125 95 71 51

403 331 288 257 233 212

568 494 450 419 394

In the absence of any confirming information of a term's ability to descriminate relevant from non-relevant documents, some large weights are computed. Compare the entries for ( r ^ O , n^=25) and (ri = 1f n i = 125). The reason for the large weight when r^=0 can be traced to overestimating the parameter p^. The f .5 technique1 is equivalent to estimating p^ with (r^+.5)/(R+1). To illustrate the extent of this overestimation we now also tabulate some p i values computed in this way when r i =0 and R ranges from 1 to 5.

R

1 2 3 4 5

.33 .20 .14 .11 .09

467 280 200 155 127

The final column headed n t shows values below which any value of n ^ will . lead to a positive value for the corresponding weight c^. These tables were in fact tabulated for the Cranfield 1400 collection for which the average n^ per query term is only 169. Therefore, frequently (sometimes large) positive values of c^ will be computed when a term is not assigned to any relevant documents. Hence the f .5 technique1 must be considered inadequate. For this reason we feel justified in proposing a different form of c i # In our previous paper [2] we suggested and evaluated a different form of c^ c i which we now realise will get ar around some of the estimation problems The weight we suggested in [2] was

-15-

J

iq

P(xifwq) d i q P ( x i f w q ) log Xi,w q P(xi)P(wq)

\

(The s u b s c r i p t q has been introduced to emphasise the f a c t t h a t our c a l c u l a t i o n s a r e always with r e s p e c t to one query q.) where
x^=0 for absence x^=1 for presence

and

w

w = W i for relevant = w 2 for non-relevant

and d^

is given conveniently by the following table:

Rewriting E^

as

\
G

iq = /- d iq D iq P iq x i' w q

then D i q =P(x i ,w ) i s t h e , d e g r e e °f involvement1 of cell (x i? w ) Lbution1 and P^ equal to the log part is called the 'probabilistic contri

It is easy to show that with D i =1 the weight G i q simplifies back to c^. But with D^q equal to the joint probability one does not need to 'adjust1 the cell values of the contingency table by .5 f s, since whenever an interior cell is zero we simply set 01og0=0 - which makes mathematical sense. We have no theoretical justification for this weight, but contrary to expectation, it outperforms the so-called independence weight c^. Obviously, since c^ is optimal under the independence assumptions, this superiority must be related to the difficulties associated with estimating c^ from small samples. It may well be that, in the light of G ^ f s . robustness and effectiveness, some theoretical justification will be found. It is this weight G^ which is used in the experiments reported in this paper.

-16-

Dependence and the role of the MST In deriving the linear weighting function g(_x) we made the assumptions of statistical independence on both w<| and W2. We could go on to say that these assumptions are unrealistic and therefore we would need to estimate the dependence between index terms to improve retrieval effectiveness. This is in fact what we set out to do in [1] and [2]. As we pointed out earlier, certain technical problems, mainly to do with estimation, prevented us from developing this approach further. However, a heuristic approach to using the maximum spanning tree has proved an interesting alternative to the strict term dependence model. For this we assume only that the terms are statistically dependent on the entire collection, and use the dependence to lead us from the query terms to further search terms. The maximum spanning tree capturing the important dependencies is generated in a similar way to that described in [1] [2] [4], Choosing an association measure from Table 1 we represent the pairwise association between index terms as a graph: the links measuring the association between any two terms represented by nodes. From this graph we can derive a maximum spanning tree, that is, a tree which spans the nodes in such a way as to have a maximum sum of links. So for each measure of association we will arrive at a different MST. Of these MST ! s one based on the expected mutual information measure (EMIM) is special. It is the basis for estimating in an optimal way the probability functions involved in our retrieval rule. The estimate is optimal in the sense that if we condition our variables in the way shown by the MST based on EMIM then we find a closer approximation to the underlying probability function than if we used any of the differently derived MST f s. Although this result is not of immediate concern, it provides the motivation for (a) selecting an MST in the first place as a useful object, and (b) preferring certain MST f s over others. In [1] it was conjectured that despite this optimality result, it may well be that a different, suboptimal but more efficiently generated, MST could give comparable retrieval performance to one generated from EMIM. In this paper we show this to be so. The optimality of the MST based on EMIM deserves some further comment. We are assuming that if one models the underlying probability functions as closely as possible then one will get the best possible retrieval. The Probability Ranking Principle guarantees this for the strict dependence model [1]; its optimality is a matter of statistical fit. As soon as one breaks away from the strict dependence model, the role of the MST changes and the optimality of the EMIM-based MST is no longer guaranteed. One can only conjecture whether or not the MST based on EMIM is still the best possible.

-17-

A different approach In this section we shall be more specific about the strategy used to expand queries with the aid of the MST and about its evaluation within a relevance feedback context. The general form of the weighting function is linear in the same way as the one derived from the independence assumption, but it is different in that the weights c^ are replaced by G^ . The basic feedback strategy is to retrieve a small set of documents, 10 or 20, by choosing the 10 or 20 documents best co-ordinated with the query. This set is then used to estimate the G^ weights for the query terms and the adjacent terms in the MST. The estimates required are for probabilities conditioned on either the relevant or non-relevant documents. In [2] we showed that the best way of doing this is to estimate PCIw-j) from the relevant documents in the feedback set and P(.iw2) from the entire set of documents minus the relevant documents in the feedback set. This is particularly important when estimating the probabilities for the probabilistic contribution in G| . The degree of involvement may be estimated in the same way although restricting the estimates to just the feedback set appears to be satisfactory [2]. Notice that whereas in [2] we felt obliged to adjust all our estimates by .5's here we have omitted to do this since we now realise that it is unsatisfactory and unnecessary. This will lead to minor discrepancies between precision and recall figures in this paper and corresponding ones in [2]. Three test collections, Cranfield 1400, UKCIS I and UKCIS II are used to measure the retrieval effectiveness of the feedback strategy under different conditions. The details of these test collections are summarised in Table 2. The method of evaluation, in terms of precision and recall, is the same as that in the earlier paper [2]. It is necessary to remind the reader that in evaluating feedback strategies we have adopted a method of residual ranking. Briefly, the feedback documents (seen by the user) are removed from the collection and precision recall figures calculated for the search on the remaining documents.

Experimental results Our benchmark for the experiments is COORD(N) where N can be either 10 or 20, indicating the size of the feedback set. (The mnemonic f name f (N) is used only in the text. In the tables columns will be headed by 'name1, and the value of N will be given at the start of the table as (cutoff=N).) COORD(N) simply continues the co-ordination level match on the remainder of the document after the feedback documents have been removed. So there is no expansion and no feedback. All strategies employing feedback are shown to be superior to this benchmark. Our first minor result is to establish the adequacy of the G i weight. For this we compare its performance on all three test collections with the independence weight c^ under the same condition: using both feedback and

-18-

expansion. The expansion is done using the standard MST generated using EMIM. To see the difference the reader should consult Table 3 and compare the precision recall figures for EMIM(N) with those for IND(N). This clearly shows the superiority of using the GjL weights on all three test collections. The result is the more remarkable for the fact that if the terms in the query and those added through expansion are assumed to be independent on both the relevant and non-relevant sets then theoretically the reverse should be the case i.e. IND(N) should be superior to EMIM(N). The major result is a comparison of the feedback strategies, using linear weighting, expansion and the G. weight, where the expansion is done with MST f s derived from the associated measures listed in Table 1. With each MST is associated a mnemonic identifying the appropriate association measure. This mnemonic is also used to identify the precision recall figures associated with the corresponding feedback experiment. These figures have been collected in Tables 4, 5 and 6, one for each test collection. For example Table 5 shows a column headed Maron in the first half of the table, and these figures pertain to a feedback strategy on UKCIS I with 10 documents in the feedback set and the MST based on the association measure labelled Maron in Table 1. No attempt has been made to graph the precision recall figures since we are only attempting to establish a no-difference effect. The figures bear out this claim: spanning trees generated from reasonable association measures do not give appreciably different retrieval results. It is interesting to note however that the MST derived from the EMIM measure on the whole gives slightly better retrieval effectiveness than all others, with one notable exception. This is particularly pleasing, since within the context of the strict dependence model [1] this is predicted. The exception is for the Maron function on the Cranfield 1400. There does not appear to be an explanation for this exceptional result. To appreciate these experimental results more exactly we have done an analysis of the feedback sets involved and the number of queries actually entering the evaluation. Tables 7, 8 and 9 show the details. For example in Table 7 for the Cranfield 1400 collection when the cut-off is 10, only 158 queries out of 225 enter the residual ranking evaluation because for the initial co-ordination level search 49 queries do not have any relevant documents in the feedback set and 18 queries have all their relevant documents in the feedback set. What to do about the queries left out of the evaluation is a difficult question to which we do not have an easy answer. The distribution of relevant documents in the feedback sets is also interesting, it shows that feedback is mostly based on only a few relevant documents. The alternative evaluation in Tables 10, 11 and 12, comparing a typical feedback experiment EMIMO0) with C00RDO0), is to emphasise the shortcoming of the data, or perhaps the low level of effectiveness of any strategy. This is particularly important in the case of UKCIS. The tables show the number of relevant documents retrieved at different cut-off levels in the residual ranking (not to be confused with the cut-off for the feedback set), and the number of queries not retrieving any relevant documents at the same levels.

-19-

For example in the case of UKCIS I (Table 11), co-ordination level matching on the remaining documents, after removing the feedback set of 10 documents, the 62 queries entering the evaluation (see Table 8) in total retrieve only 120 relevant documents when the residual ranking is cut-off at 20. Also, at the same cut-off, 24 of these 62 queries retrieved no relevant documents at all. Compare that with the feedback strategy EMIM(IO) shown in the adjoining column and one gets some idea of the dramatic improvement achieved by feedback: the number of relevant documents retrieved is doubled whereas the number of queries not retrieving any relevant documents is almost halved. However, most discouragingly, the figure at the bottom of the column shows that at a cut-off greater than 200 we still have 1807 (C00RDO0)) and 1710 (EMIM(10)) relevant documents to retrieve. In other words a large number of relevant documents simply remain irretrievable whether one uses feedback or not. Probably the only way to capture these documents is through document clustering.

Concluding Remarks We have shown how an MST derived from the distribution of co-occurrences of index terms in a document collection may be used to expand a query. The MST may be constructed using any of a number of reasonable measures of association. Within the simple feedback strategy described the different MST f s lead to approximately the same retrieval effectiveness, although on the whole, the MST based on the expected mutual information measure performs marginally better than any of the others. The method of query expansion via the MST is admittedly only very crude, but it constitutes a first step in the direction of a more refined approach. Obviously a more selective mechanism for expanding queries is needed, but this can only be done by developing some appropriate theory for following different branches of the MST. In fact one could go further than this and attempt to construct a theory which would enable a decision to be made as to whether it is more profitable to look at a nearest neighbour of a relevant document from the feedback set, or whether it is more profitable to proceed to a new search term given by the MST. No doubt ultimately some structure will be discovered which will enable, at any stage of the search, the tradeoff between retrieving a nearest neighbour and selecting a closely associated search term to be evaluated. We think that one of the major stumbling blocks to further developments in 'probabilistic information retrieval1 is the lack of a comprehensive theory about the estimation, from small biased samples, of the probabilities involved. The statistical literature seems to offer little guidance on this point. We have made some ad hoc suggestions, which may well be justifiable in theoretical terms. Our experiments show that the G. weight is superior to the so-called independent weight c^. Unfortunately this result is not unequivocal. We have found that in some rare circumstances c^ gives better performance than G- and we do not understand the reason for this. We believe that some theoretical work on the estimation rules involved may

-20-

throw some light on this.

-21-

REFERENCES 1. C.J. VAN RIJSBERGEN, A theoretical basis for the use of co-occurrence data in information retrieval. J. Docum. 1977, 33, 106-119. 2. D.J. HARPER and C.J. VAN RIJSBERGEN, An evaluation of feedback in document retrieval using co-occurrence data. J. Docum. 1978, 34, 189-216. 3. S.E. ROBERTSON and K. SPARCK JONES, Relevance weighting of search terms. J. ASIS 1976, 27, 129-416. 4. C.J. VAN RIJSBERGEN, Information Retrieval, Second Edition, Butterworths, London, 1979. 5. S.E. ROBERTSON, The probability ranking principle in IR. 33, 294-304. J. Docum. 1977,

6. S.P. HARTER, A probabilistic approach to automatic keyword indexing, Ph.D. Thesis, University of Chicago, 1974. 7. V.K.M. WHITNEY, Minimal spanning tree, Algorithm 422. 15, 273-274. Commun. ACM. 1972,

8. R.C. PRIM, Shortest connection networks and some generalizations. Syst. Tech. J. 1957, 36, 1389-1401.

Bell

-22-

APPENDIX The MST f s used in these experiments were generated by a BCPL program which follows the algorithm given by Whitney [7] f which in turn is an encoding in FORTRAN of the 'classic1 algorithm of Prim [8]. To connect N terms in an MSTf the Whitney/Prim algorithm requires 0(N ) term comparisons. Each term comparison involves the computation of the number of documents in which the terms co-occur, and (in the case of EMIM) the computation of four logarithms. Since N will usually be several thousand (typically between 5000 and 10000) a crude encoding of the algorithm would lead to a program that was too slow to be of any utility. But by making use of a number of optimisation strategies, an MST program was devised which, although heavy on computing resources, is practicable. The most important optimisation strategy is worth describing. If t is a term in a document collection, we denote by D(t) the set of documents in which t occurs. If d is a document, we denote by T(d) the set of terms contained in d. The mapping d -> T(d) is given by the document file, and t -> D(t) by the inverted file. The set of those terms with which t co-occurs in at least one document, C(t), is given by C(t) = union ( T(d) i d in D(t) ) (i.e. the union of the sets T(d) for which d is in D(t)), which may be written C(t) = T(D(t)) If we make the assumption that links in the MST will not be between terms with zero co-occurrence, then we do not need to compare t with any terms outside the set C(t). The program therefore estimates the size of C(t) for each t, and if it is small compared with the total term size, computes C(t) and uses this as a list of terms for comparison with t. In the case of Cranfield 1400, for example, only 10 of a potential total number of 3597903 comparisons are made to construct an MST. Two further points need to be made. The first is that C(t) is computed by mapping t -> D(t) -> T(D(t)), and this involves easy access to the document file and inverted document file. In fact these are both held in core in our implementation, and this imposes a strict upper limit to the size of the collections for which the MST can be generated. The second point is that for certain similarity measures (EMIM is one of them) terms with co-occurrence zero can in principle be linked in the MST. Consequently the MST f s used in these experiments are not necessarily exact, although we believe that the disparity, if there is one, is very slight, and should not affect the experimental results.

-23-

Cosine

P(x i =1 f Xj=1) ^/(P(Xi=1) P(xj=D)

Dice

2 P(x i =1 t Xj=1) P(xi=1)+P(xj=1)

EMIM

\ P(x if Xj) /_ P(x i§ Xj) log X i f Xj P(xi)P(Xj) P(x i =1 l Xj=1) - P(x i =1) P(Xj=1) EMIM/Entropy

Maron Raj ski

T
Entropy - /_ P(x if Xj) log P(x if Xj)
X. X•

Table 1

The different association measures used to generate the MST's,

-24-

Cranfield 1400

UKCIS I

UKCIS II

no. of documents no. of terms no. of requests average no. of terms per document average number of relevant documents per request

1400 2683 225 29.9 7.2

11613 12000 142 6.8 28.6

15748 8882 152 6.4 43.8

Table 2

Collection details.

-25-

Cranfield 1400 \ p EMIM R \ 0 45.77 10 43.42 20 38.26 30 33.85 40 29.76 50 27.11 60 21.53 70 19.25 80 16.90 90 13.47 100 13.25 IND 37.53 35.59 31.91 28.42 25.91 23.93 17.90 15.07 12.91 10.90 10.64

Cutoff?10 UKCIS I EMIM 50.52 35.68 27.43 22.25 18.14 15.82 13.35 9.43 8.94 5.17 3.77 IND 41.61 23.89 18.62 15.73 12.58 11.04 9.01 6.16 5.71 2.81 1.55

UKCIS I I EMIM 57.21 38.43 26.79 21.04 14.68 12.18 8.53 5.65 3.81 1.47 1.37 IND 41.72 29.65 20.02 15.44 12.46 9.10 6.66 5.25 3.34 1.69 1.51

Cranfield 1400
\ p EMIM R \ 0 41.59 10 38.48 20 33.37 30 28.14 40 24.97 50 23.28 60 16.77 70 13.85 80 12.33 10.66 90 100 10.42 IND 33.27 30.60 27.75 22.58 20.93 19.28 13.25 10.12 8.88 7.66 7.38

Cutoff=20 UKCIS I
EMIM 43.82 31.55 24.82 20.94 16.73 14.76 9.49 7.39 6.75 5.19 4.05 IND 36.38 21.03 16.06 13.60 11.32 10.11 6.12 4.66 4.27 2.52 1.36

UKCIS II
EMIM 57.38 38.32 25.91 19.95 15.13 10.29 7.76 5.60 4.22 1.46 1.31 IND 36.79 26.92 17.53 13.79 10.81 6.80 5.06 4.12 2.94 1.55 1.40

Table 3

A comparison, in terms of precision and recall, of two relevance feedback strategies with query expansion via the MST based on the expected mutual information measure. EMIM uses the G- weight and IND uses the independent c ^ weight. Cutoff indicates the size of the . feedback set.

-26-

Cranfield 1400 (cutoff=10) \ p Cosine R \ 0 42.43 10 40.21 20 35.35 30 30.45 40 27.70 50 24.95 60 19.31 70 15.13 80 13.11 90 11.10 100 10.95 Dice 43.96 42.25 37.86 33.82 29.53 27.72 21.28 16.25 14.37 11.87 11.64 EMIM 45.77 43.42 38.26 33.85 29.76 27.11 21.53 19.25 16.90 13.47 13.25 Maron 47.82 45.58 39.09 34.83 31.39 29.27 22.59 19.43 16.57 13.29 13.03 Rajski 44.61 41.59 38.26 32.83 28.67 26.53 20.73 17.24 15.34 12.47 12.29 Co-ord 28.30 26.73 24.86 20.62 17.15 15.12 10.92 8.97 7.36 6.13 6.00

Cranfj .eld 1400 ( c u t o f f = 2 0 ) \ P Cosine R \ 0 38.94 10 37.13 20 31.79 30 26.97 40 24.11 21.84 50 60 15.20 70 12.23 11.41 80 90 9.85 9.61 100 Dice 38.94 36.64 31.55 26.02 23.31 22.03 16.35 11.99 10.84 9.32 9.08 EMIM 41.59 38.48 33.37 28.14 24.97 23.28 16.77 13.85 12.33 10.66 10.42 Maron 42.99 39.99 34.65 29.29 25.68 23.72 17.25 13.97 12.14 10.51 10.21 Raj ski 38.53 36.25 32.64 27.24 22.95 21.08 16.38 13.46 11.92 10.62 10.37 Co-ord 17.90 16.29 15.58 12.67 10.58 10.09 7.70 5.34 4.94 3.97 3.89

Table 4

A comparison of the effectiveness, in terms of precision and recall, of different MST's based on a range of association measures. Each experiment uses relevance feedback and expansion. Cutoff indicates the size of the feedback set.

-27-

UKCIS I (cutoff=10)
\ p Cosine R \ 0 49.73 10 33.09 25.61 20 30 22.67 40 18.55 50 15.93 13.40 60 70 8.71 8.06 80 90 5.87 100 4.47 Dice 47.75 32.61 24.90 21.94 17.85 15.18 12.79 8.60 7.10 4.97 3.57 EMIM 50.52 35.68 27.43 22.25 18.14 15.82 13.35 9.43 8.94 5.17 3.77 Maron 48.80 34.03 25.67 21.25 17.69 15.30 13.22 9.27 8.84 5.25 3.78 Raj ski 48.63 31.98 24.34 22.01 18.19 15.82 13.37 8.68 8.06 5.87 4.48 Co-ord 30.77 16.73 11.86 9.79 7.20 5.77 4.92 3.29 1.87 1.18 0.59

UKCIS I (cutoff=20)
\ p Cosine R \ 0 38.82 10 29.13 23.75 20 30 20.67 40 16.45 14.18 50 60 9.36 7.34 70 6.85 80 90 5.19 4.05 100 Dice 39.94 27.86 22.15 18.91 15.76 13.25 8.39 6.33 5.93 4.27 3.12 EMIM 43.82 31.55 24.82 20.94 16.73 14.76 9.49 7.39 6.75 5.19 4.05 Maron 43.37 30.02 23.94 19.85 16.28 14.09 9.58 7.39 6.79 5.28 4.05 Rajski 38.24 28.68 22.72 19.99 15.69 13.94 9.37 7.36 6.87 5.19 4.05 Co-ord 21.04 11.52 8.22 6.48 5.28 4.60 2.92 2.45 2.24 1.57 1.16

Table 5

As for Table 4 but with a different test collection.

-28-

UKCIS II (cutoff=10) \ p Cosine R \ 0 50.60 10 36.20 20 27.11 30 21.34 40 14.77 50 12.05 60 8.14 70 5.45 80 3.66 90 1.24 100 1.12 Dice 51.66 36.04 27.89 20.74 14.51 12.13 8.17 5.46 3.71 1.25 1.12 EMIM 57.21 38.43 26.79 21.04 14.68 12.18 8.53 5.65 3.81 1.47 1.37 Maron 55.62 35.85 25.07 19.73 14.62 11.79 8.31 5.14 3.37 1.15 1.08 Raj ski 52.56 35.69 27.43 20.40 14.96 12.20 8.48 5.43 3.67 1.24 1.12 Co-ord 28.29 17.53 13.30 11.14 7.29 5.62 4.33 3.74 1.81 0.80 0.66

UKCIS II (cutoff=20)

\ p Cosine
R \ 0

Dice 53.59 37.08 25.70 18.30 13.88 10.16 7.38 5.17 4.14 1.27 1.12

EMIM 57.38 38.32 25.91 19.95 15.13 10.29 7.76 5.60 4.22 1.46 1.31

Maron 56.12 36.81 24.31 18.93 14.59 9.97 7.15 4.83 3.80 1.16 1.05

Raj ski 52.65 36.83 25.79 17.95 13.84 9.89 7.11 5.16 3.99 1.25 1.12

Co-ord 21.11 13.26 8.06 6.39 5.37 3.67 2.76 2.35 1.38 0.80 0.69

10 20 30 40 50 60 70 80 90 100

51.45 35.63 25.02 18.27 13.65 9.65 7.13 5.18 4.06 1.25 1.10

Table 6

As for Table 4 but with a different test collection.

-29-

Feedback set for Cranfield 1400 (cutoff = 10) no. no. no. no. of of of of queries queries queries queries = 225 in evaluation = 158 with no relevant documents = 49 with all relevant documents = 18

Distribution

no. o f r e l s : no. o f q u e r i e s :

1 51

2 35

3 37

4 17

5 14

6 1

7 1

8 1

9 1 (total

158)

Feedback set for Cranfield 1400 (cutoff = 20) no. no. no. no. of of of of queries queries queries queries = 225 in evaluation = 164 with no relevant documents = 32 with all relevant documents = 29

Distribution

no. of r e l s :
no. of queries:

1
39

2
34

3
31

4
21

5
15

6
11

7
3

8
5

9
1

10 11
1 3 (total 164)

Table 7

A breakdown of the feedback sets. The distribution shows the number of queries that have a different number of relevant documents in the feedback set.

-30-

Feedback s e t for UKCIS I ( c u t o f f
no. no. no. no. of of of of queries queries queries queries

=10)

= 142 in evaluation = 62 with no relevant documents = 77 with all relevant documents = 3

Distribution

no. of rels: no. of queries:

1 35

2 10

3 1

1 6

5 3

6 3

8 2

9 2 (total 62)

Feedback set for UKCIS I (cutoff = 20) no. no. no. no. of of of of queries queries queries queries = 142 in evaluation = 72 with no relevant documents = 66 with all relevant documents = 4

Distribution

no. of r e l s : no. of q u e r i e s :

1 30

2 12

3

4 9

5 4

6 4

7 2

8 3

3

10 12 2 2

18 1 (total 72)

Table 8

As for Table 7 but with a different test collection.

-31-

Feedback s e t for UKCIS I I (cutoff = 10)
no. no. no. no. of of of of queries queries queries queries = 152 in evaluation = 80 with no relevant documents = 70 with all relevant documents = 2

Distribution

no. of r e l s :
no. of queries:

1
33

2
16

3
11

^
7

5
3

6
2

8
5

9
1

10
2 (total 80)

Feedback set for UKCIS II (cutoff = 20) no. no. no. no. of of of of queries queries queries queries = 152 in evaluation = 88 with no relevant documents = 62 with all relevant documents = 2

Distribution

no. of rels: no. of queries:

1 2 3 28 16 15

4 8

5 6

6 2

8 3

9 10 12 14 15 16 18 20 1 2 1 2 1 1 1 1 (total 88)

Table 9

As for Table 7 but with a different test collection.

-32-

Cranfield 1400 (cutoff = 10)

Co-ord Cutoff Rels Retr. Retr. None

l

EMIM Retr.

Rels Retr.

10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200
200+

148 234 303 351 393 428 462 489 512 526 543 561 578 593 603 611 627 652 663 679 249

74 44 30 25 24 20 18 15 14 14 13 12 11 10 10 10 10 7 6 5

250 337 406 460 496 533 568 598 616 635 658 671 688 703 714 720 735 746 753 758 170

41 26 18 14 12 12 11 10 10 8 7 7 6 6 6 6 6 6 6 6

Table 10

The residual ranking is cut-off at different values 10(10)200f and at each one 'Rels Retr.1 indicates the number of relevant documents retrieved at that rank position. fRetr. None' indicates the number of queries that have retrieved no relevant documents at that rank position. The rank position 200+ shows the number of relevant documents that remain to be retrieved at rank 200. Cutoff indicates the size of the feedback set.

-33-

UKCIS I (cutoff = 1 0 )

Co-ord Cutoff Rels Retr. Retr. None

EMIM Rels Retr. Retr.

10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200
200+

74 120 170 225 265 286 297 306 321 333 347 368 387 396 402 416 437 448 466 475
1807

33 24 21 19 15 12 12 12 11 10 10 10 8 8 8 8 8 8 8 8

156 240 313 366 391 413 439 454 466 484 494 500 508 516 520 529 541 544 558 572
1710

18 15 13 13 13 13 12 12 12 12 12 12 11 11 11 10 10 10 9 9

Table 11

As for Table 10 but with a different test collection.

-34-

UKCIS II (cutoff = 10)

Co-ord Cutoff Rels Retr. Retr. None

EMIM Rels Retr. Retr.

10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200
200+

101 173 233 282 332 369 406 446 479 506 531 548 567 586 622 639 671 695 717 732
4538

46 37 29 24 23 22 21 18 17 13 12 12 12 11 9 9 8 8 7 7

240 386 491 577 651 707 761 802 849 906 961
1005 1059 1098 1135 1171 1214 1250 1276 1295 3975

20 15 13 12 11 9 9 8 8 8 8 8 8 8 8 8 7 6 6 6

Table 12

As for Table 10 but with a different test collection.

-35-