Bl

Section B : Statistical methods of determining relevance assessment requirements

This section constitutes the main body of the Report,

It is devoted

primarily to the detailed presentation, in technical terms, of the statistical methods considered for determining relevance assessment requirements. presentation is intended to be comprehensive and self-contained. The

But since

it is recognised that some at least of these interested in the 'ideal1 collection may not be statistically trained, an attempt has been made to provide separate, brief, Noddy-style accounts of the arguments as prefaces to the full presentation of each main method*. One of the methods, the 'Pool1 This is

method, was initially put forward in the design study report. therefore examined first.

The other methods investigated, and in particular

the main alternative 'Squares' method, are then described, and the Pool and Squares methods are compared. These methods are designed to provide

assessments for future experiments comparing two indexing or searching strategies, and the section concludes with a discussion of approaches to assessment for multi-strategy comparisons. therefore as follows. Bl Preface: note on the work proposed in the grant application, and its actual conduct by H. Gilbert. B2 B3 B4 B5 B6 The Pool method. Other methods : dead ends. The Squares method. Comparison between the Pool and Squares methods. Multi-strategy comparisons. The structure of the section is

B

1

Preface: the re search propos ed_ _ a_nd _ i ts conduct

The application was for a small grant to support a trained statistician for three months to carry out the following pieces of work:

* There is thus a certain amount of repetition, but it was thought that this was acceptable in the interests of self-contained non-specialist and specialist presentations.

B2

a)

to approach the relevance assessment problem from first principles i) to

check the initial approach of the design study report (Sparck Jones and Bates (1977)) and either confirm it or present an alternative; and ii) to work out the consequent detailed figures for the numbers of assessments required in different circumstances. This was envisaged as analytic work based on the

assumptions originally made, and in particular on two strong assumptions, namely that at building time the searches could be guaranteed to draw out all the relevant documents for a request, and all the documents which would be retrieved in any sensible search. b) to develop the argument to cover the cases where these assumptions are

weakened, as they certainly should be, i.e. to allow for only most (some specified percentage) of the relevant or possible output documents being retrieved. This development could perhaps be pushed through analytically, or

could involve some fairly straightforward computer simulation using standard statistical packages. Again detailed figures should be derived.

In general, where appropriate, tests with available collections should be conducted to check the statistical arguments.

The work has been carried out by H. Gilbert, a graduate student from the University Statistical Laboratory, with advice from Dr. P. Altham of that Laboratory, and from Dr. S.E. Robertson and Dr. C.J. van Rijsbergen. In

general the conduct of the project has been as proposed: the main difference has been that a range of methods has been considered, even though the study of the initial method confirmed the arguments on which it is based.

The project work can be split into two parts.

Firstly, a detailed

examination was made of the method of the design study report, here referred to as the Pool method. This examination took the form of tightening the Secondly,

assumptions made and improving the accuracy of the figures produced.

alternative methods were considered in the hope of finding one which reduces the number of assessments required. Of these alternatives two are presented and

in detail, one of which was abandoned for reasons stated in chapter B3 the other, referred to as the Squares method, is presented as a viable alternative in chapter

B 4 . The search for a suitable method of obtaining

assessments involved a fairly extensive literature survey, so the discussion of approaches to assessment in this report is hopefully a comprehensive one.

B3

Any lengthy tables have been relegated to Appendix 1, while the computing which the work involved is discussed in Appendix 2.

Bl.l General requirement

The main criterion used when calculating the sample size required to be assessed is that it should be large enough to make a comparison between any two retrieval strategies meaningful.

Meaningful has a specialised interpretation here which is centred on the statistical concepts of significance level and power_.

Suppose that we wish to test the hypothesis that there is no difference in recall between strategies A and B for a set of requests against the alternative hypothesis that strategy A is better. the null hypothesis, by H . be made (i) H could be rejected when it is, in fact, true could be accepted when it is false. Denote the first hypothesis,

Then there are two possible mistakes which could

(ii) H

Clearly it is desirable in any statistical test to place an upper bound on the probabilities of these events occurring. The significance level of a

test refers to the upper bound on the probability of event (i) occurring. Equivalent to placing an upper bound on the probability of event (ii) occurring is placing a lower bound on the probability of rejecting H This probability is known as the power of the test. when it is false.

The calculations in the remainder of section B are based on a significance level of 0.05(or 0.01) and a power of 0.95.

In general this means that the probability of making an incorrect decision has been reduced to an acceptable level.

For example, it will be shown later (see Table 1) that if we have 300 requests and an average of 25 relevant or retrieved documents per request, then if we wish the probability of error(i) (and the probability of error(ii)) occurring to be less than 0.05, there must have been 15 documents of known (defined as A better than B) relevance status assessed and the number of successes/friust be greater than 167. To achieve the 15 known documents 60% of the post must have been assessed.

B4

B2

The Pool method

B.2.1 Non-statistical summary

The method discussed below is referred to for convenience as the Pool method. The argument for it is basically the same as that used in the

design study report, with some alterations and improvements.

One of the major changes from the original argument is that the strong assumptions concerning the output of possible future searches and the percentage of relevant documents contained in the pool are no longer necessary. The cost of discarding these assumptions is that instead of the

argument resulting in what percentage of the pool should be assessed per request, it results in saying how many documents should be assessed given that the pool size is N. That is, for each different size of pool a

separate calculation must be performed to reveal the number of documents to be assessed.

A rough, non-statistical outline of the argument is as follows: suppose that there are k requests in the request set and that we wish to compare strategies A and B for this set. Then these k requests are thought precision)

of as k trials and in each trial the recall (or alternatively value of strategies A and B are compared.

To compare two strategies only the Note that the

part of each output which has been assessed is considered.

recall value is the probability of a document being retrieved given that it is relevant and the precision value is the probability of a document being relevant given that it is retrieved.

Next it is noted on how many trials the recall (precision) value of A is better than the corresponding value of B. This number is then tested to

see if it is likely to have arisen just through the variation due to sampling or whether it represents a real difference between the strategies. latter is the case then A is said to be better than B. If the

Clearly, for the comparison between A and B (typically in future experiments using the 'ideal1 collection) to be valid on this basis, the assessed sample must be adequate. This assessment sample would be drawn from

B5

the pooled output of searches done when the collection was built, and in the design study report it was assumed that the pool would contain the output of any future strategies and all relevant documents for a request. The pool

could thus be quite large, but for increasingly large request sets it was found that progressively smaller samples would need assessment. The argument

to establish the percentage sample required for a given request set size was essentially as follows. It works backwards from the way, described above,

the assessments are to be used.

We assume that what we are trying to establish is that there is a significant difference between two probabilities (or two proportions), based on sample estimates of them. That is we wish to show that there is a

significant difference between recall, or alternatively precision, expressed as a probability, for two strategies A and B, given the results of applying A and B for a set of requests; the difference itself is to be analysed in terms of probabilities, respectively that strategy A is better than B (Prob(A>B)) and B better than A (Prob(B>A)). The sets of search results are regarded as

samples because, as we cannot have exhaustive assessments, we compare A and B for each request with respect to a collection subset consisting of documents of known relevance status. (Notice that this will ordinarily be a subset of the We therefore need to know

sets of documents actually retrieved by A and B.)

how large this evaluation sample must be for a valid comparison between A and B for any request and hence over all requests, and further to know what size of assessment sample of the pool is required to ensure that the outputs of any strategies A and B respectively will contain large enough valuation samples. Different sizes of evaluation and assessment sample will be required

for different sizes of request set and of germane document set, as described below. It must moreover be emphasised that while the formal argument for sample

size is the same for recall and for precision, the evaluation sample for recall must consist only of relevant documents, while for precision the documents may be either relevant or non-relevant. The expression "documents of known relevance

status" will be used to cover both cases where a specific distinction is not required, but the underlying difference between the two should be borne in mind.

For the purposes of discussion we regard an occurence of A>B as a success. To characterise the probability distribution of successes we use the normal approximation to the binomial distribution. As a significance test to be applied

B6

to our comparison between A and B we choos.e the sign test because it makes few a priori assumptions about the data. For the two strategies we order each request in terms of effectiveness, i.e. the effectiveness of a request under A > its effectiveness under B. Effectiveness here is either precision

The null hypothesis, H , is o that there is no difference between A and B for the set of requests, i.e. Prob(A>B) = Prob(B>A) = h. Since the test is based on the binomial distribution we can use the normal approximation in an entirely standard way to find the critical region for the test, that is, that value of the standardised normal variable which needs to be exceeded for H to be rejected o at 5% significance level. If k is the number of requests, and x the number of successes, then under H : Prob(A>B) = Prob(B>A) = h and we get o 2X " k * N(0,1) /k Using normal tables we then find 2x - k > 2

or recall, which are regarded as probabilities.

A
gives 5% significance. This means, for example, that if k = 100 (requests), we must have at least 60 successes, i.e. 60 requests where A>B for either recall or precision, whichever we are using. Proceeding as just described would be all that was necessary if there were no uncertainties about the probabilities being compared, that is no uncertainty about recall or precision for each request, because strategy output could be related to global assessment information. Unfortunately it cannot, as this information is lacking, and we are in fact trying to decide whether A>B or B>A on the basis of two samples, one for A and one for B. Because requests differ, the difference between A and B will vary across the request set. So even if there is a real difference between A and B, it will be obscured, and even more so if the evaluation samples for the requests are unreliable due to size, or to bias in the assessment samples from which they are derived. Clearly if the samples for the requests are infinite (very large), the difference between A and B over the set can be confidently established; but such sampling is exactly what is not feasible.

We therefore first assume that the probabilities we are trying to estimate, i.e. recall or precision, are constant over requests, that is that

B7

the number or relevant documents or of retrieved documents are the same for all requests. This is clearly artificial, but may be interpreted from a We can then calculate the

practical point of view as referring to averages.

minimum evaluation sample size for each (and hence every) request necessary for the sign test to show a significant difference. This in turn requires The bigger the real

an assumed real difference between the two strategies.

difference, the smaller the sample size needed to reflect it; however prudence suggests assuming a small real difference, which may be taken, conventionally, as 5%, i.e. p - p = 5%. For the calculation a standard sampling theorem

for differences, again allowing the use of the normal approximation to the binomial distribution, can be utilised. For a given real difference of 5%

and some given (evaluation) sample size n, the theorem gives us P(x >x ) , A B i.e. the probability of recall (or precision) for A being greater than that for B. Conversely, for given P(x >x ) we can derive the sample size n A B required to achieve P. Then as constancy across requests is being assumed,

we can obtain, for P, the expected number of requests with A>B; or conversely, if we take as our expected number that number required by the sign test, as described above, we can work backwards to determine the sample size needed to achieve this expected number of requests with A>B. Thus for the example

above with 300 requests, and an expected number of requests with A>B = 167, we would need an evaluation sample of 15 documents of known relevance status. However if we design our data for this expected proportion of successes, this is not sufficient because of the uncertainty introduced by the fact that we are sampling. That is, while our evaluation sample may in principle be

adequate to tell us whether we have the required number of successes, in practice we cannot rely on accurate enough data to obtain the required evaluation sample in our output more than half the time. If we provide

ourselves with a larger evaluation sample, supplying in principle more information than we need, we have a better chance of obtaining enough actual information in practice. In other words, referring back to the discussion

of the significance test, in designing for the expected number, we may in practice find that the number of A , s>B , s will fall below the critical value represented by this number 50% of the time. We would in fact like a higher

chance of significance in our results, i.e. a higher ck nee of rejecting the null hypothesis if it is indeed false (so p - p = 5% is true). This can A B only be done by increasing the probability of P(x >x ) , which means increasing

B8

the evaluation sample sizewhen p - p

Specifically we. want to ensure a 95% chance

— 5% that the number of A , s>B , s exceeds the critical value.

We thus ask for what value P(x >x ) will it be the case that there is a A B 95% chance of significance rather than only a 50% chance. We use the normal

approximation -to the binomial, as before, to obtain this value, and hence the required size of evaluation sample. Once we have the sample size we can use it in a straightforward way to calculate the percentage of the pool to be assessed, and so can draw a random assessment sample from each request pool as appropriate. This is done

by relating the number of specific documents with known relevance status of the evaluation sample to the presumed or known total number of documents associated with a request. For the computation of recall the latter is the

total of relevant documents, for precision the total of retrieved documents for any strategy. Table 0 gives an illustrative selection of the somewhat

crudely calculated figures of the design study report, to show how the argument works out in practice, throughout interpreting the constant figures of the statistical argument as averages. For example we see that for 5% significance

and 500 requests, given 9 known relevance status documents and a total of 50 relevant or retrieved documents per request, 18% of the pool needs assessment; for 300 requests and 15 known, 30% needs assessment. pool obtained for requests in building the 'ideal
1

Thus if the

collection contained

1000 documents, this would imply assessment of a random sample of 180 documents in the first case and 300 in the second.

In practice slightly different consequences ensue according to whether we are interested in recall or precision, because we are concerned with different document sets, the relevant and the retrieved, respectively, and these are normally different. If they were identical, as above, the same However as the former is

percentage of the pool would have to be assessed.

usually smaller this implies, for a given known relevant sample, a higher percentage assessment. Recall thus imposes more stringent requirements for

assessment than precision, and should so be used in considering collection building effort.

B9

As will be shown in the technical discussion, paragraph 2-2.2, the essential argument for the Pool method can be developed without the use of the two assumptions about the nature, but not size of the pool, but as a tradeoff a specific pool size must be given.

For reference the complete set of assumptions underlying the whole argument as originally presented in the design study report may be summarised as follows, 1 for future experiments comparing strategies A and B 1 2 3 4 5 6 we evaluate using recall and precision; recall and precision are probabilities estimated by proportions based on samples; we use the sign test for validating performance differences; a percentage difference, say of 5%, between A and B, in recall or precision, is indicated by Prob - Prob = 5%; a normal sampling distribution for difference of proportions; a normal approximation to the binomial distribution for the power of the sign test.

2 f or_ assessment data 1 2 3 4 all relevant documents are contained in the pool; the output of A, and of B, is contained in the pool; a sample from the pool is a random sample; a pool random sample is also a strategy output random sample.

3 for request data in evaluation and assessment 1 2 the requests are independent; the probability of finding strategy A better than strategy B is constant across requests.

The detailed analysis which follows confirms that the original argument for the Pool method was essentially correct, and that the detailed figures given in the design study report were relatively accurate. figures were derived, and are given in Table 1. More carefully worked

With respect to the assumptions

given above, those about the evaluation methodology were perforce retained. For the assessment data, the strong assumptions 2.1 and 2.2 could, as mentioned, be jettisoned. Overall, the really important assumptions are those about the

requests, 3.1 and 3.2, and these cannot be avoided.

BIO

B2.2 Statistical presentation

B2.2.1 Scrutiny of design study argument

The experimental design assumed was that each of the two strategies was applied to a set of requests. The sign test was then used. That is, suppose there are k requests: then let n. be the number of assessed docu•f-V»

ments relevant to the i
th

request, a. and b, be the number of documents
1 1

relevant to the i

request retrieved by strategies A ans B respectively be the number of documents retrieved by stra-

(i = l,2,.. f k), and let n , n tegies A and B respectively.

The recall and precision values for strategy A can then be estimated a. a, by the proportions — and — respectively. So A is considered to, be better n th n ' A a. b. than B for the i request (using the recall criterion) if — > — (that is, n, n l 1 a. > b.) . . . . I l Define the random variable X. (i = l,..,k) such that
l

C 1 if a. > b,

X

= [

x

1

i ( 0 otherwise. Then the sign test refers E X. to Bi(k,^). (1) The reason for this is that, if the statement H is assumed to be true o (H = that there is no difference between the strategies), then P(a > b ) o i I (= probability that a. > b.) - P(b. > a.) = * for i = l,..,k. a
1 1 1 1

From (1) the probability of any specific value of EX. occurring, given that H is true, can be obtained. If this probability is below the signifi-

cance level of the test (and is therefore unlikely to have occurred due to sampling variation), then H is rejected in favour of A being better than

•/

°
l

B if EX. is large, and in favour of B being better than A if EX. is small. 1 Having decided on using the sign test the next stage was to find out the minimum number of requests for which strategy A would have to be better than strategy B in order to reject H in favour of A being better than B

(at the 5% [or 1%] significance level).

The way that this was done in the design study was to use the normal

Bll

approximation to the binomial distribution, that is: as k - - , >
X

~

k p

— a, N(0,1)

(2)

•kp(l - pf
where x is the number of successes (that is, the number of requests on which A was better than B ) , and p is the probability of success (that is, P(a. > b.) for recall^ which is assumed to be constant over the request set.

An improvement which was made here was to introduce the correction for continuity. This correction is necessary since the normal distribution is for a continuous variable while the binomial distribution involves a discrete variable. Regarding the observed frequency x as occupying an interval, the lower limit of which is half a unit below the observed frequency while the upper limit is half a unit above the observed frequency, the correction consists of reducing by 0.5 the difference between the observed value of x and the expected value of x (kp).

So (2) is replaced by
( x i

°-

5 )

--^N(0,1)

. is true.) Clearly A would

Sk/4
(Note that p = h since we are assuming that H not be deemed better than B if x < hk, to be larger than hk and so we have z = (x - 0,5) - hk ,„ ,, — ^ N (0,1) /k/4 so it follows that x can be assumed

Using, the tables of the normal distribution (Cambridge Statistical Tables) it can be seen that H is rejected in favour of A being better than o B at the 5% significance level if
X

- °' 5 " ^ k > 1.96

.

yfk/l
That is x > Similarly H o 1.96/k + k + 1 .

would be rejected at the 1% level if 2.58/k + k + 1

x >

This calculation provides us with column three of Table 1 (see Appendix 1 ) .

B12

Now that an upper bound on the probability of mistakenly rejecting H has been established we turn to placing a lower bound on the probability

of correctly rejecting H , that is, on the power of the test. In this o particular case 0.95 is the lower bound which was chosen. That is r we now have to find the value of p (=p , say) such that

• P(X>
Define

^

V

1

^

V

P

)

,0.95, Vp.po

.

r1.96/k + k + 1= [ ] c 2 (the integral part); then, since x can only take integer values we require p such that o P(x > x / p) > 0.95, Vp > p c o That is, x - 0.5 - kp P(z > / p) > 0.95, Vp > p /kp(l - kp) °

x

where z ^ N(0,1).

This value can be found by experimenting with different values of p and using a set of tables of the standardised normal distribution. Hence the fourth column of Table 1 is obtained.

Column five lists the values of the sample size required to identify a difference between the two strategies with 95% power and 5% (1%) significance. To obtain this column we need to use the following sampling theorem which can be found in Hoel (1971), p. 135. Since the pool is being sampled, the observed values of the recall (precision) proportion for strategies A and B are random variables. So let p observed values based on n and n and p represent the

trials respectively, for a particular

request, from two binomial populations with probability p and p resA B pectively. Then p and p denote the true recall (precision) values for A B strategies A and B. Applying the central limit theorem we obtain: Theorem When the number of trials n and n are sufficiently large (say, >25) = p, - p_ *A *B

or £ and j are normally distributed with mean y^ > 5 „ *A *B ' ^P A - p (note that p - p tween strategies A and B) and variance
^ u '" P~ = p^ ^A *B ?
P

denoted the true difference in recall [precision] be-

AqA

P

BqB
n

n^ A

+

^ B

where q = 1 - p . In the recall case n = n = the number of documents ^A *A A B

B13

which are relevant to the particular request, while in the case of precision n and n an cision n and n are the number of documents retrieved by strategies A A B and B respectively.

In the argument of the design study report it was then tacitly assumed that n = n = n, say. This assumption clearly holds for recall

since the number of documents relevant to any particular request is independent of which strategy is searching for them (remembering that we are dealing with documents which have already been assessed). However in the case of precision the assumption is saying that the number of assessed documents retrieved by each strategy is the same, which is clearly not a realistic assumption.

The assumption can be avoided by defining n = min(n , n ). We now A B define a. b. (— - — ) - (PA " P j n n A B A B P q P B% y, A A /( + ) n n_ A B where a. and b. were defined on page BIO. Then z ^ N(0,1) (by the theorem) Then we must choose n such that P(a. > b.) > p
1 1 o

(see column four of Table

1) so that we have 95% power. Unless we assume a lower limit on the real difference in recall (precision), that is p - p , we are unable to do A B this, so assume that p - p > 0.05. ^A *B We enlarge here upon the argument for precision, since the argument for recall is the same (with n = n ). A B Now a. , x P(— b. 1 ^ / > 0) = P( z > -0.05

But we do not know the values of p

and p r so we must use the inequality A B obtained by simple calculus, that p a < h, P D q o ^ h. This inequality is A A B D

not too crude, since for h < p < h , pq only varies over the range (0.187, 0.250).

B14

T h e r e f o r e , if
P(z > -0.5/2n)

^ o ,
>
iP

(3)

then
p(Z

>

- 0 . 05

yA q A

P

RqR B

o

Using the tables of the normal distribution we can find the smallest value of n satisfying (3). If the recall criterion is being used, n is equivalent to the number of assessed relevant documents, and if we are considering precision then n is the number of assessed documents retrieved by a strategy.

If exhaustive relevance judgement was possible then the above argument would be sufficient, with n denoting the number of documents of known relevance status per request. That is, given the number of requests available, the above argument obtains the minimum number of relevant documents in the case of recall, or retrieved documents in the case of precision, which must be assessed so that the result holds at the 5% significance level. This calculation was carried out in detail and Table 2 and Figure 1 were constructed.

Table 2(a) indicates how many requests are required so that a result can be obtained at the 5% (1%) significance level with 95% power, given the minimum number of documents of known relevance status (minimised over the request set).

The reverse requirement, how many documents are required for a given number of requests, is shown in Table 2(b). For example, if there are 200 requests, then unless each request has at least 21 documents of known status a result cannot be obtained at the 5% significance level.

Note that these figures only concern those experiments where the sign test is used to analyse the results. An alternative to the sign test is presented in chapter B4.
P

AqA n +
A

P

BqB n_ .
R

In the theorem above it was stated that var (p - 6 ) = A B

McNemar (1947) points out that this only holds when the samples from which the proportions p and p are drawn from are independent. In this

B15

case however, since the two strategies search the same set of documents, there is a definite correlation between p and p . This correlation results A B in a decrease in the variance, which would mean that the values of n obtained in column five of Table 1 are slightly on the large size. B2.2.2 Improved sampling rationale: the modified Pool method

In the argument used in the design study report, after n is calculated, the percentage of the pool required to be assessed in order to have, say, n documents of known relevance status assessed is calculated. The way this was done was to say that, for recall, if one has K relevant

documents altogether in the set and it is required to have n assessed then, if it is assumed that all the relevant documents are in the pool, one should estimate 100 x — % of the pool. For the precision criterion K the parallel argument assumes that all the output of future searches is contained in the pool. However, when the specified proportion is assessed then very often n required documents would not be obtained in the sample, since n is only the number which one would expect to obtain. An alternative method which gives more chance of obtaining the required number of documents of known relevance status is the following. Putting it in terms of recall, suppose that we have observed N documents in the pool, K of which are relevant, and suppose that at least n of these K must be assessed. Then, if a simple random sample of size S is taken for assessment, the probability that at least n relevant documents are contained in this sample is min(S,K) _ _ I (K) ( _ K ) * ¥—H — (N) n where ( ) = n (N-n)!n! Therefore S can be chosen such that the probability of having assessed n relevant documents is at least 0.95. Table 3 lists the values of S for various values of N, K and n. In order to construct the table, a program was written in standard FORTRAN (see Appendix 2 ) . (Note that the level of confidence is a user-controlled parameter of the program. We illustrate the results of setting it to 95% but it could of course be set lower.)

(hypergeometric distribution)

B16

Note that this argument does not require any assumptions about the output of future strategies or whether or not all the relevant documents are in the pool; also, one can say with greater confidence that the required number of relevant documents have been assessed. For reasons given in Appendix 2, there is small error in each of the figures, but it is still safe to use them provided the lower bound of 0.95 is taken as approximate (0.93 - 0.97).

As an example, consider the case when there are 300 requests and the significance level is taken to be 5%. Then Table 1 states that 15 relevant documents are required to be assessed, and that if there are 25 relevant documents altogether then 60% of the pool should be assessed.

However, if there are 1000 documents in the pool and 600 of them are assessed, then Table 3 states that we can only be "95% confident" that 11 relevant documents have been assessed, and that at least 729 documents would have to be assessed to be equally confident of assessing 15 relevant documents.

B2.2.3 Accuracy of estimators

As was mentioned earlier, p true proportions p

and p

are observed estimates of the

and p . If these estimates are required to have a cer-

tain accuracy then this places another lower bound on the number of relevant documents to be assessed for recall or the number of documents retrieved for precision.

So suppose that n

is to be chosen to ensure that (4)

P( lpA - P A I > d) < a where d and a are small and

A Assume, for the moment, that — (the sampling fraction) is small enough A N that 1 - — can be approximated by 1, and that N is large enough that is close to 1. Then
P

z -

A ' AqA

P

A

% N ( 0 , ,1)

/p

n
A

B17

So n must be chosen such that A P( |z| > — ^ — ) < a

n

A

That is, d /p q > z a

n A where z. is the double-tailed a-point of N(0,1). That is, p(|z| > z ) = . and it can be found from the tables. So 2 p q z ^ ^A^A a n > A " .2

a,

The problem now is that p and q are unknown .but we know that p q A A A A has a maximum value of h, so taking 2 z . a n > 4d will certainly satisfy the accuracy requirements of (4).

If the sampling fraction is large enough that it has to be retained then n Np q var(p) = (1 - — ) N (N - l)n So to satisfy P( |p - p I > d) < a we need

PAV, lrPAqA „4-l (1 + H - ^ ^ - ID
A where v = (—) z a v N v

(5)

(5) is not directly applicable since p

is not known precisely. Again this

can be overcome by replacing p q by \ . These calculations were performed in detail and tabulated in Table r\4. Pi

Note that whenever we have used the inequality P A q A - ^ alternative forms of action could be 1) perform a pilot study to yield a preliminary estimate of p , or 2) earlier experimental work using strategy A may give an indication of the true value of p in this case. A

B18

B2.2.4 Weakening the pool assumptions

In paragraph B2.2.1 of the foregoing the Pool method was presented using the strong assumptions of the design study report to the effect that the pool contains all relevant and retrieved documents. In paragraph B2.2.2 it was shown that these assumptions could be abandoned at the cost of having some specific pool size, i.e. knowing the pool size and calculating the assessment sample accordingly. The effect of weakening the assumptions without abandoning them altogether was therefore investigated to see what this would imply for assessment. Table 5 gives the percentage of the pool required if only 90% of the relevant documents (or of the search output) is deemed to be in the pool. Weakening the assumptions presents no difficulties for the general structure of the argument, and does not have any very drastic effects on the numbers of documents requiring assessment, especially for larger numbers of relevant or retrieved documents. For example, for recall for 300 requests and 25 relevant documents, assuming only 90% coverage in the pool implies a 66.7% assessment sample rather than a 60% one. In general weakening the assumptions to the indicated extent means that the percentage of the pool to be assessed increases by up to 10 percent, which in practice would probably, not be too costly.

B2.2.5 Conclusion on the Pool method

The argument presented in the design study report is statistically correct given that the assumptions hold; but the way in which the percentage of the pool required for assessment is calculated can be improved on, as indicated. Further, the original assumption set can be reduced: specifically the assumptions about the content of the pool are not required. The important assumptions which are still required are therefore: 3.1 3.2 the requests are independent; P(a. > b.) is constant for i = l,..,k.

Assumption 3.2 that P(A > B) is constant across requests is clearly unlikely to be true, and if it were not then our calculations for column three of Table 1 would be invalidf although for the moment this assumption is sufficient to enable us to obtain an idea of the magnitude of the sample size required to be assessed.

Also, in the case when there are only 5 or 10 documents of known

B19

relevance status the sampling theorem cannot be used unless it is assumed that £>A and p g are normally distributed. How confidently this assumption can be made is a question which anyone using the sign test with a small number of documents of known status should consider carefully.

As assumption 3.2 is somewhat unrealistic, alternative methods not requiring it or other equally strong assumptions were sought. The investigation of alternatives was also stimulated by the desire to simply reduce the number of assessments required, in order to lower the cost of building the 'ideal1 collection. The lines of work undertaken are described in the following chapter.

B3

Other methods: dead ends

This section is concerned mainly with methods which were investigated which turned out turned out to be unsatisfactory. One of the main difficulties was the fact that most tests applicable are conditional but in this situation no data is available, since we require to know the sample

size to be assessed before the search strategies are applied.

The first approach was a rather lengthy literature survey to see if this problem(or an analogous one) had been considered elsewhere. This proved to be virtually fruitless. In particular the type of paper which was looked at was that dealing with 2x2 contingency tables, since these were thought to be the most informative way of tabulating data. A typical example of such a table for the present retrieval context would be the following.

retrieved by A retrieved by B not retrieved by B
a.
l L.

not retrieved by A
1

a.
1

c.
l

a.
l

+ c.
l

(1)
b.
1
»^i—-—_-_—,— i ", i i i "i

d. c.
! i

b.

l

+ d. n.

l

+ b.
l

l

+ d.

l

!

1

where a. is the number of relevant documents (for the i

request) re-

trieved by strategies A and B, b. is the number of relevant documents retrieved by strategy B but not by strategy A, etc. An analogous table

B20

could be constructed for non-relevant documents.

B3.1 Logistic model

One way of approaching the problem using this technique is to construct such-a table for each request and to be left with a series of 2x2 contingency tables. Cox (1970) analyses a series of 2x2 tables resulting from two independent samples. However, as was pointed out earlier, here the samples are not independent (since both strategies search the same document set), and so Cox's method could not be applied directly. To overcome this we made use of the logistic model which is outlined below.

Suppose there are k requests, and let n. denote the number of documents relevant to the i request.

Define ( 1 if the j document relevant to the i is retrieved by strategy A 13 ( 0 otherwise I — l,..,k, j — 1,..,n., and ( 1 if the j document relevant to the i t is retrieved by strategy B
y

request

request

ij

( 0 otherwise i = 1,..,k, j = 1,..,n

Then assume the following model P(x. . = 1)

log ^ - i
log
p(

P(x. . = 0) ~J ij
= Q)

-—ID = -6 + A

JL,..,n., 1

_ , •• , K L

j = l,..,n±, i = l,..,k,

which assumes that the difference between A and B is constant (on the logistic scale) across requests. Note that the parameter 6 can be thought of as the 'strategy1 effect and A. . can be thought of as the 'document' effect.

So, it follows that (6 + X. .)x. . P(x i .,y. j ) =
e

( 6 + A. .)y. . 13 13
/n

6 + X. ,x

-6 + X. . ,
13)

(1 + e ij) (1 4 e (assuming x.., y.. are independent) So

B21

P(x,y 6,A..) =±2l JlJ P(x ij ,y. j / 6,A..)
e 6(x.. - y..) SEA..(x.. + y. .) e 13 13 13

^i /, S + X . .v , „ -6 + A . ., .n. .n. (1 + e 13) (1 + e 13) 1=1 3=1 where x.. = .E .x . ., y.. = . E .y. .. i,D ID 1/3 ID Therefore the distribution of (x.. - y..)/(x.. + y.., i = 1,..,k, 3 = 1,..,n.) 13 13 1 depends only on 6. So we should consider the distribution of x.. (x.. + y.., i = 1,..,k, j = 1,..,n.). So we seek the distribution of the 13 13 1 sum of independent variables.

If x. . + y. . = 0 the x. . = 0, and if x. . + y. . = 2 then x. . = 1, so it ID ID ID ID ID ID is sufficient to consider the distribution of i5j: X ij 7 ^ i j + Y iJ' 1 = 1 " - k ' j = 1 ' - ' n i ) x. ,+y. .=1 iD ID ' This is equivalent to considering the distribution of k Eb. / (b. + c., i = l,..,k) x X 1 x It is easily seen that 6 P(x. . = 1 / x.. + y.. = 1) = —r-^ -r 13 13 13 6 -6
J

*

(see (1)).

J

J

e

+

e

and so, on the null hypothesis that there is no difference in performance between the strategies ( 6 = 0 ) we obtain that b. / (b. + c.) % Bi(b. + c.,h)
1 1 1
1 1

Given (b, 4 c., i = l,..,k), b f ..b_ are independent; thus 1 1 I k k k Eb. / (b. + c , i = l,..,k) % Bi(E(b. + 1 1 1 1 ± 1 c.),h)
1

So, with this model, for a uniformly most powerful unbiassed size a test, we should reject 6 = 0 in favour of 6 > 0 (that is, A better than B)

if and only if k k Eb. > C(E(b. + c.) ,a)
x

1

x

1

1

k where C is found from Bi(E(b. + c.),^) tables (probably a normal approximation will do ) . Note that if ( 1 for b. > c.
A =
X X

\ 0 otherwise i = l,..,k k and we do a sign test, treating EA. as Bi(k,^) on the null hypothesis, we 1 X i

B22 presumably get a less powerful test than the one above, since the parak meters of the binomial distribution are respectively k and £(b, + c.) 1 k 1 1 and by definition k < E(b. + c . ) . So the sign test will be conservative.
1 1

k However, in practice the value of £(b. + c.) is not known before the 1 1 1 experiment. There are two main ways of trying to overcome this.

Firstly, choosing a small sample and thus obtaining a confidence k k interval for E(b. + c . ) . Alternatively, the expected value of E(b. + c.)
1

i

i

x

a.

i

can be used. Now with x. .,y. , and b. defined as before, i = l,..,k, j = l,..,n. ID ID i J it is easily seen that n.

b. = 1
so •

• T1 j=l D

E V . (1 - y. , )
!D

6 4 A. . E(b.) = •£ .^ Dl + e 13
e

2A. . £
D(1

. + e

M

,

e

1J

^—r
ID)

.

ID)

(1 + e

Similarly, c. =
1

n. Z 1 (l - x. .)y. . j»l ^ ^ 2X . . 1 3 6 + A. ,x /n -6 + A. . , (1 + e lj) (1 + e lj)
e

and so -6 + X . . E(c ) = I— ^ i \ -6 + X. , 1 + e IJ Hence 6 + X. . -6 + X., „ 2X. . E( b + c ) - E — 3J + «e ID 2e_ij i Ci .. 6 + A.. -6 + A.. 5 + A..w1 -6 + A... M Dl + e 13 1 + e ID (1 + e 13) (1 + e ID) and so E(E(b. + c.)) follows easily, and clearly depends on A,,, i = l,..,k,
i
^

S *

j = l,..,n.. These parameters are also unknown, and so this is as far as the analysis of this method reached. One remaining possibility was to assign A e a prior distribution to the A. . f s (for example r uniform on (0,1)), and J 1 + e then perform a Bayes preposterior analysis. However, the integrations involved would have to be done numerically and would be multidimensional. Also, because of the number of parameters involved, it is hard to see what is being said, implicitly, about the data when a particular prior distribution is assigned to the parameters; so it is doubtful whether or not the time spent performing the integrations would have been worthwhile. This method was therefore abandoned so that alternatives could be examined.

B23

B3.2 Wilcoxon's signed ranks test

The next method to be attempted was replacing the sign test of the design study report by Wilcoxon's matched-pairs signed-ranks test. So the situation is that the two strategies, A and B, are compared, as before, over the request set.

Now let d. denote the difference in recall (precision) between the strategies for the i request. It now has to be decided whether or not it

is meaningful to rank these differences: if so, let r. denote the rank of d.. Suppose there are k requests in total; then define k

T
w

= Er.s.
ii strategy A i s b e t t e r than strategy B for the i request

where s. = +1 if

and -1 otherwise. Assuming hypothesis H

to be t r u e , i t i s seen that s. = +1 = 0 and so

with p r o b a b i l i t y h and - 1 with probability h. Thus E(s.) 2 E(T ) = 0, and i t follows that var(T ) = E(T ) . w w w

Now E(T
2

) = E ( £ r . s . ) 2 = E{Er.2s.2 + 2 i i ii x

i3fij

E r.s.r.s.) i i 3 3

I t can e a s i l y be shown t h a t 2 E(s. ) = 1 and E ( s . s . ) i iD

=0

(i * j ) , i , j

= l,..,k.

Therefore
var(T
k 2 2 1 ) = E ( T ) = Z r . = 7 k ( k + 1 ) ( 2 k + 1) w w 1 6

Now denote P(A > B) (again assumed constant over the requests) by p. Then E(s.) = 2p - 1, i = 1,. . ,k and so E(T ) = Er. (2p - 1) = ^k(k 4 1) (2p - 1) w 1 1 2
k 2 2 ) = Er. E(s. ) + 2 E r.r.E(s.s.)

E(T E(s
1

w

. 1 1

1

...13 i j l*j

2

) = 1 as before. Assuming independence between requests it follows

that E(S.S.) = E(s.)E(s.) = (2p - l ) 2 ID I D

B24

Thus
E(T
2

w

) = ^k(k + 1) (2k + 1) + 2(2p - l ) 2 I r.r,
6 . . l j

2 Z r.r. = (Zr.)2 - Zr.2 = ^k2 (k + l ) 2 - 7k (k + 1) (2k + 1)
. 1 1 1*J 1 1 D

Therefore var(T ) = E(T 2 ) = |k(k + 1) (2k + l)p(l - p) w w 3 For k large enough (greater than 25) we can assume that T is normal-

ly distributed, and since (given p) its mean and variance are known, the exact form of the distribution is known. Table 6 is analogous to Table 1. Column four lists the smallest value c, say, such that
p(IT

I >

H

) < 0.05

w o and was obtained by using the fact that under H T ^ N(0,^k(k + 1)(2k + 1)) w 6 and by using tables of the normal distribution. This results in c = (^k(k + 1) (2k + 1))J5(1.96) 6 Column five lists the smallest p > 0.5 such that P( T w > c / p) > 0.95 (5% significance)

and this is calculated in the same manner as in paragraph B2.2.1. Similarly column six is arrived at by the same means as before. It should be noted that the number of documents of known relevance status required to be assessed is higher than when the sign test was used, and so the sign test is better. This is surprising since Wilcoxon's matched-pairs test would appear to be making fuller use of the data as it assigns more weight to a request which shows a large difference between the two conditions than to a request which shows a small difference (that is, it pays attention to the magnitude of the difference as well as the direction) The fact that the sign test is better in this situation can be proved algebraically as follows. The power of the sign test and of Wilcoxon's test are * P(z > /kp(l - p) and

B25

p ( z

>

c - ^ ( 2 p - l ) k ( k + 1) - ^ ( k + 1) (2k + l ) p ( l - p )

}

r e s p e c t i v e l y , where z ^ N ( 0 , 1 ) , c = AjLwk * c = (1.96)/k + k + 1 .
6

+

-,) / 2 v

+

D^1*96^

a n d

So the sign test is more powerful than the Wilcoxon matched-pairs signed-ranks test if c - M 2 p - l)k(k + 1)
/ >

c
A P ( 1

- kp "
P)

( | k ( k + l ) ( 2 k + 1)P(1 -p>

After some simple algebraic manipulation this reduces to (2k/2k + 1 - k / 6 ( k + l ) ) p > (k + l ) / 2 k + 1 - k / 3 ( k + lY 2 The table below lists the minimum value of P(A > B) (that is, p) for the sign test to be more powerful, given the number of requests, k.

300 400 500 600 700 800 900 1000

0.513 0.509 0.507 0.506 0.505 0.505 0.504 0.504

As, in general, the value of p required for the test to have 95% power is greater than 0.55, it follows that the sign test will be more powerful.

This result is probably due to the fact that the underlying distribution is binomial. Apart from the fact that more documents of known status are required to be assessed, other disadvantages of Wilcoxon1s matchedpairs signed ranks test are the cost involved in ranking the differences and also that the ranking may not be meaningful. So this method is not recommended.

B3.3 Likelihood ratio test

Next the likelihood ratio test was attempted. For this it was assumed that

B26

P

Y

i

=

P

A. " B. *
1 1

P

N(P

AqA

+ P

BqB

A"V

Z;
1

>
request, and

where

p i

- P i
H

are the observed recall values for the i

we test o '• P v ~ P„ = 0 , ^A *^B vs H_ : p^ * p . 1 A *B
JC

Thus we have a random sample (y ,..y ) where k is the number of requests.
1

The l i k e l i h o o d

function
e

P(y ; 6) = . S i - L V 27Tm

x p ( ^ ( y . - <PA - P B »

n.
I

where 0 = (P7v'Pn)' - i- the number of assessed documents relevant to the , A B l i request, and m = p q + p q (assuming independence of the requests

n

s

and that p , p

remain constant over the request set).

We consider the ratio

P(y / 6,H1)

and reject H

if it is too large. First P ( _ / 0,H ) has to be maximised with y

respect to p and p . Equivalently, maximise logP(y / 0,H ) . This results in A B 1 having to solve the following equation 2kp A 3 - (3k + N)p A 2 + p A (2kp B (l - p B ) - 2 E n i Y i - 2p B N - k) + P g d " P R ) + Ir^y.
+

P B N - kp B (l - p B ) and p interchanged. This

where N = En., and a similar equation with p method was not pursued.

B4

The Squares method

B4.1 Non-statistical summary

This section is devoted to an alternative method which could be developed in a satisfactory way. This method, which we have called the Squares method, is therefore the main competitor to the Pool method, and in chapter B5, following the technical presentation of the Squares method , the two are compared.

B27

The Squares method is based on a retrieval contingency table giving, for two (indexing, searching) strategies A and B, the number of documents retrieved by both A and B, the number retrieved by A but not by B, by B but not A, and by neither A nor B: i.e. the table cells represent A A B, A A -B, -A A B, and -A A -B. Such a table may in principle refer to any documents retrieved, or more specifically to relevant documents only, or to non-relevant documents. In the present context the first of these is irrelevant: our interest is in the ability of strategies to obtain relevant documents, or avoid non-relevant ones. Further, as the object is to compare A and B, the cells of interest are those where the two strategies differ, i.e. those where A retrieves but B does not, or vice versa. It is moreover evident, as the tables are confined to either relevant or non-relevant documents, that the method is suited to the use of recall as a performance measure, but not precision, for which tables would have to be combined, and fallout is therefore substituted for precision as a performance measure. The discussion which follows is carried out in terms of recall, the treatment of fallout being easily seen by analogy.

Essentially the Pool method aims at determining, within the framework given by the use of the sign test, specified significance level, etc., whether the evaluation sample of documents of known relevance status justifies the assertion that strategies A and B differ in performance. The Squares method approaches the problem from a rather different angle by looking to see what differences in the specific contingency table cells are required to support the assertion; i.e. it is concerned to determine whether, for recall superiority in strategy A say, the number of relevant documents retrieved by A alone, in relation to the number of relevant documents retrieved by either A or B alone, is larger than would be expected if A and B performed the same. In other words, we evaluate A A -B/((A A -B) V (-A A B)). Alternatively if the number of documents retrieved by A is sufficiently smaller than would be expected, the null hypothesis that A and B do not differ is rejected in favour of B's superiority to A.

The method requires certain assumptions. In order to forecast the required sample size, i.e. total number of known relevant documents, a value has to be assigned (a) to the probability that a relevant document is retrieved by only one strategy and (b), more specifically (assuming that we are looking for A better than B ) , to the probability that a relevant

B28

document is retrieved by A, given that it is retrieved by just one strategy. We may deal with (b) by placing a lower bound on the probability, analogous to the assumption p - p > 5% for the Pool method. But for (a)

a specific value is desirable, so if this is not forthcoming from previous experience, for example, the only course is to set a lower bound here too and take this as the value the probability has.

The appropriate sample size is obtained as follows. If n. is the total of known relevant documents for request i, a. the relevant documents retrieved by A and B, b, those retrieved by A alone, c. those re1 l

trieved by B alone, and d. those retrieved by neither, we have a contingency table as follows:

retr retr not retr a.
I

not retr
b.
l

c.
I

d.
l

n.
i
*
••

...

i

Summing over k requests we define n = En., a = la. , b. . i i similarly, to obtain Table I: relevant documents retr not retr B retr
a c

and d.

not retr
b d n

L

We first assume that the distribution of (a,b,c,d) , given the value of n, is multinomial with probabilities (p ,p_ ,p ,p_) , defined as

a b e d Mn(n; p ,p, ,p , p n ) ; i . e . we assume t h a t the p r o b a b i l i t y of a r e l e v a n t docua b c a ment n o t being r e t r i e v e d by e i t h e r s t r a t e g y i s p . of being r e t r i e v e d by B
d alone is p , by A alone is p, , and by both is p . Thus Table I has a corresc b a ponding probability table:

B29

relevant documents
t

retr
! ' ' —

not retr
••'' ' ' " •

retr not retr

!

a
p

i
c

b
p

T

d

II

Given then some known relevant documents, the probability of obtaining some specific (observed) set of values for a,b,c,d in Table I, denoted by P(a,b,c,d/n), is ( n! a b e d . ( a li v ! c Md1!P aa P K P C P d i f a + b + c + d = n a b, n i D ,

P(a,b,c,d/n)

( 0 otherwise
The recall value for strategy A is determined by p mated by a + b and for strategy B it is p + p, (in practice estia b + p , so to compare strategies and p .

A and B for recall it is sufficient to compare p

Now if the value of b + c is known the distribution of b is P^ Bi (b + c,
PK b
+

) , that is has a binomial distribution. There are b + c P„ c

relevant documents which could land in the (1,2)th element of Table I, and
PK

for any such document the probability that it does land there is b c or A say. So, given the value of b + c, and using the normal approximation to the binomial distribution (as for the evaluation of the critical region for the Pool method), we can find K(a,b + c) such that the hypothesis P, = p is rejected in favour of p, > p just when b > K(a,b + c) . K is b e b e a constant depending on the significance level a of the test used (one assuming a multinomial distribution for the table), and on the value of b + c. If p. = p then A = h, so b has the distribution Bi(b + c ,h). b c if a = 0.05, say, K must be such that P ( b > K / P = P ) < a . Thus

As for the Pool method, we require stable power of the test as well as a specified significance level; specifically, we want the power of the test to be at least 0.95. This imposes a lower bound on the value of A, which can be found by trial and error, like the lower bound imposed on P(A > B) for the Pool method. In practice this means that when the true value of A exceeds this lower bound we can be confident that we are correctly rejecting the hypothesis p = p , i.e. the power is greater than 0.95.

B30

Unfortunately, the value of b + c is not known (in advance, that is, of actual experiments with any pair of strategies A, B ) . However it is known that b + c has the distribution Bi(n,II ) , where I represents p I + p

(the probability a document is retrieved by one strategy only), so if some value of I is assumed, the exact form of its distribution will be known and I we can then find r and s, say, such that P ( r < b + c < s ) = 0 . 9 5 . We can also

find the expected value of s, say e. Then given any specific number of relevant documents, n in Table I, and assuming some specific value of I , subI
K.

stituting b + c by r, e and s respectively gives the value of K(a,b + c) , and the lower bound on the true value of A. If the nature of the strategies being compared is such that they may be expected to retrieve different relevant documents, the assumed value of I should be quite large, while if they are I R expected to be the same, the assumed value should be low. The Pool method specifies requirements for the number of documents to be assessed for each request, to provide enough relevance information for that request. The Squares method is concerned only with sets of documents of known relevance status for a set of requests. This is an advantage for the latter,as will be discussed more fully in chapter B5 f However it should be noted here that relevance assessments are necessarily assessments for individual requests and the only way of obtaining the total set n is by summing the assessments for the individual requests. Thus the Squares method necessarily relies on a pool and sample basis for assessment like that used for the Pool method. In other words the argument assumes that n is obtained as the sum of n.'s, where each n. is a random sample of a comprehensive pool. However, as will be emphasised in chapter B5, as there is no requirement that the individual n. f s satisfy specific requirements, and should not vary, the size of sample and of request set can be chosen pragmatically, and the latter in particular can be adjusted in relation to the observed rather than expected properties of the n.'s and accumulating n.

To illustrate the kind of figure obtained from the argument (see Table 7 ) , for 5% significance and 95% power in the test, and a conservative I = 0.25, i.e. a value of I assuming a high overlap in output between straI I R R tegies A and B and hence a small performance difference between them: if n = 3000, b + c has a lower confidence bound of 704 and b must exceed 378 if A is to be accepted as better than B, where the minimum value of A

B31

required for 95% power must be > 0.568. On the other hand, if I = 0 . 5 0 , T R assuming less overlap, n = 2000, b + c has lower bound 938, b must exceed 499, and A must > 0.559. If we are concerned with recall, the more awkward case, we could perhaps expect to get a total of 2000 relevant documents from 80 requests averaging 25 relevant documents each, and 3000 from 120 requests. We need a large n as this is associated with a low value of A: a low A represents a realistic expectation about the number of documents in cell b in the contingency table in relation to those in b + c, i.e. about the number of documents to be retrieved by strategy A alone, as opposed to by either A or B alone. Thus if n = 50, the expected value of b + c = 12.5, b is at least 11 and A = 0.935; and the chance in practice of finding such a relatively large b is very poor. The assumptions underlying the Squares method can be summarised, for comparison with those used for the Pool method, as follows 1 for future experiments comparing strategies A and B 1 2 3 we evaluate using recall and fallout; recall and fallout are probabilities estimated by proportions based on samples; the distribution of the retrieved document sets a,b,c,d in the contingency table comparing A and B, conditional on the total n, is Mn(n; p ,Pb,P ,p^); the distribution of set b (retrieved by A but not B ) , conditional on the set b + c retrieved by A alone or B alone, is Pb Bi (b + c, •— (alias Bi (b + c,A)); P b + Pc the distribution of b + c is Bi(n,p + p ) (alias Bi(n,n )); a normal approximation to the binomial distribution for the power of the significance test.

4

5 6

2 for assessment data the same four assumptions as the Pool method. 3 for request data in evaluation and assessment 1 the requests are independent.

The two methods thus chare assumptions 2.1 - 2.4, and 3.1. But as the technical argument of paragraph B2.2.2 above showed, assumptions 2.1 and 2.2 are not necessary and can be abandoned in favour of simply knowing the pool size (though the argument was presented under the Pool method it is equally applicable to the Squares method). Assumptions 1.1 and 1.2 have the same general character for the two methods, with the specific difference that fallout replaces precision. Assumption 1.6 for the Squares method is

B32

also like 1.6 for the Pool method. On the other hand an important feature of the Squares method is that the strong assumption 3.2 required for the Pool method, namely that the probability of finding strategy A better than strategy B is constant across requests, is not required. The distinctive asssumptions for the Squares method, 1.3, 1.4 and 1.5, replacing 1.3-5

for the Pool method, follow from the structure imposed on the data, and lead to a rather different argument. However within the framework of this argument these assumptions are somewhat analogous to those used for the Pool method.

B4.2 Statistical presentation

We wish to compare the two strategies A and B. Let n. denote the total number of assessed documents relevant to the i request, a. denote

the number of these retrieved by both strategies, b. the number retrieved by strategy A but not strategy B, c. denote the number retrieved by strategy B but not by strategy A, and d. denote the number retrieved by neither strategy. This information can be summarised in the following table: relevant documents retr not retr B retr not retr b.
l

r
a.
I

I

d.
l

n.
-

I

|
i

The Squares method presented below can only be used if retrieval data can be set out in this form.

We now define, for k requests n = k £ n. , a =
. i=l. i

k £ a. , b =
- i=l i
1

k k £ b. , c = £ c. and d
- i i=l
1

T. d.

- i=l

i

1

i=l

X

and obtain Table I: relevant documents retr not retr B retr a c not retr b d n
L __ 1

B33

Clearlyf to compare the two strategies for the request set, a and d are not as important as b and c. This method bases its decision on the size of b relative to b + c: if b is relatively large (small) then the hypothesis of equality can be rejected in favour of strategy A being better than strategy B (vice versa).

Table I has an analogous table of probabilities, namely relevant documents retr not retr B retr
P

not retr |
P

a c

b II d

P

P

and tables analogous to Tables I and II exist for non-relevant documents, namely non-relevant documents retr not retr B retr a' not retr b' d' n'

c'

non-relevant documents retr not retr

B retr p^
P

not retr

p

b- 1

c-

*d. 1
a
JD
C Q

Let N denote the total number of documents assessed, and suppose that the distribution of (a,b,c,d/n) in Mn(n; p ,PK,P , p j and that the distrib u t i o n of (a1,b',c',d'/N P(a,b,c,d/n) - n) i s Mn(N - n; p , , p ,,p ,,p , ) . That is,

n! a b c d . _ ( —: —p p n p pn i f a £ a ! b ! c d r a *b ^ c ^d ( 0 otherwise

+ b + c + d = n

Similarly P(a',b',c',d'/N - n ) .

Now the recall value of strategy A is p
a

+ p
J3

and for strategy B it

is p + p . So strategy A is better for recall if p, > p . Similarly straa c b e tegy A is better for fallout if p < p ,. (Note that strategy A is better

B34

for precision if P + P, P + P

P a + *b + *a' + ^b 1 Pi_ P i Pn ,

P P P . P 1 *a + *c + *a' + ^c ,

however, as using precision for evaluation requires a combination of relevant and non-relevant document tables, and generally leads to complexity, it is much more convenient to use fallout to complement recall, as fallout only requires the one non-relevant document table.

p

The conditional test of p, = p against p. > p b c b e = p if and only if b > K(b + c,a) . = p

has the rule, reject
J

(1)

That is, reject p

if b is larger than a constant K, which depends on

b + c and a (the significance level). Also the distribution of b / (b + c)
P

is Bi(b + c , b the test is

b c

) (providing b + c * 0) , so the conditional power of

P

P(b > K(b + c,a) / b + c, b and the unconditional power of (1) is

b -=—) c

(2)

Ph -=—) + p b c It is known that b + c °o Bi(n,p, + p ) and so b c E. (P[b > K(b + c,a)] / b + c, b+c p

P(b + c = k ) = 0 ( P b
Therefore the power is

+

Pc)kd-Pb-Pc)n_k

•

n £ P(b > K(b + c,a) / b + c,A)P(b + c) b+c=l
P

where A = Pi_ ^b

+

b P ^c

So

in order to find the power of the test P(b + c = k) must be evalu-

ated for k = l,..,n and K(b + c,a) must be found, that is the smallest K such that P(b > K / b + c,A = h) < 0.05 where b ^ Bi(b + c , ^ ) .

Since n will usually be very large the normal approximation to the binomial distribution can be used to calculate P(b + c ) ; also for b + c > 10

B35

the normal approximation can be used to find K.

Finally P(b > K(b + c,a) / A) has to be calculated for b + c = l,..,n. Again the normal approximation can be used for b + c > 10.

Since n is likely to be very large these calculations should clearly be performed on a computer. However precisely because n is very large and P(b + c ) , K(b + c,a) and P(b > K(b + c,a) / A) must all be calculated for b + c = l,..,n, it would take an extremely large amount of computer time to find the power for any particular values of n and A.

As part of the object of this report is to construct fairly comprehensive tables, which in this case means varying the values of n and A, it was decided that it would not be feasible to calculate the exact power on a computer.

A rather crude method of overcoming this problem was to calculate a 95% confidence interval for b + c and then to replace b + c in (2) by its upper and lower bounds in the confidence interval, and by its expected value. This results in three different values for the conditional power of the test,which should give us a reasonable indication of its unconditional power.

Now b + c ^ Bi(n,II ) where n
R r v .

= p
D

+ p , so E(b + c) = n l . Also we i
C t\

wish to find r and s such that P(r : b + c < s) ( = s Z (n)TI 3 VJ - I ) (1 I l j'"R " "R -,=r
SD

) = 0.95

That is, to find the largest r and smallest s such that P(b + c < r) < 0.025 and P(b + c > s) < 0.025.

Again we use the normal approximation, which leads us to choose the largest r and smallest s such that r - nn R < -1.96 and s - nn R ;— > 1.96.

/nn (1

R

"^

^V 1 " V

B36

So a value of I (the probability in relation to recall that a relevant I R document is only retrieved by one strategy) must be assumed. Once a value for b + c is found (viz r, s or E(b + c)), the fact that b ^ Bi(b + c,^) (the distribution is conditional on b + c)

if the strategies have no difference in performance, is used to evaluate K such that .H is rejected if b > K. o It is now necessary to ensure that the test has 95% power, and this forces a lower bound on A. This can be found by a trial and error method similar to the one used to find the lower bound of p in the Pool method.

Table 7 lists the value of K for various values of n and I , and I
R

Pb also gives the minimum value of for which one can be confident ^b *c (in the sense of achieving 95% power) of correctly rejecting H . For each

value of n and I the values of b + c tabulated are on the lower bound of I R the 95% confidence interval, its expected value, and the upper bound of

the confidence interval respectively.
p

b

increases as n and/or b c I increases. Also as n increases there is little difference between the I
P

Predictably enough the minimum value of

three values of the lower bound for fixed values of I . I

Before using Table 7 an estimate must be made of I . If the two straI
R

tegies are very different then I may be thought to be very high (> 0.75, I R say) , while if the two strategies are very similar than I could be about I R 0.25 (or less) ( I is expected to be small if b + c is expected to be I R small relative to n, and is expected to be large if b + c is expected to be R relatively large, since, if the experiment had been performed then I I would be estimated by )

Suppose that two strategies A and B are being compared for recall and that it is expected that approximately half the relevant documents will be retrieved by only one strategy (that is I ^ 0.50) (information concerning I
R

I may be available from earlier studies) . If one has a total of 2000 releI
R

vant documents then reject H
1

(at the 5% significance level) if b > 563

(assuming the 'worst case for b + c ) . One then has 95% power of correctly

B37

detecting that A is better than B if
P

> 0.559. b
+ P

c

Note that since b + c is expected to lie between 938 and 1062, one has to take the worst case both for the value of K and for the lower bound on A.

Conversely, suppose that one wishes to be confident (in the sense of 95% power and 5% significance) of deciding that A is better than B, whenever A > 0.55; then approximately 3000 documents (relevant to any of the requests) are required to be assessed (assuming I = 0.50) . Note that there I R may well be overlap in these documents,but providing the requests are assumed independent this does not matter. Further, this overlap in documents does not affect the number of assessments made. The hypergeometric program of Appendix 2 assumes that all the documents are distinct, but it is equally applicable when they are not, since the assessments are all distinct. Finally note that A is better than B for fallout if p < p ,. So we

can use the same argument as for recall, only this time n is the total number of non-relevant documents.

Summary before This test is a conditional test and so,/being able to apply it a value of I and a lower bound on A must be assumed. Independence between I R requests is assumed but this time P (A > B) is not assumed to be constant across requests. The actual mechanics of choosing the sample size could prove a problem: this is discussed in more detail in the next chapter.

B5

Comparison between the Pool and Squares methods

This chapter compares the Pool and Squares methods, primarily from the point of view of their interpretation and implications.

To make actual numerical comparisons is difficult since the number of documents to be assessed if the Squares method is used depends on I . I R Comparisons when I is assumed are made later in the chapter. I

B38

One of the main differences between the methods is the amount of information which must be obtained. For the Pool method all that is needed is how many relevant documents the individual strategies retrieved. However for the Squares method a further classification is required, since as well as knowing the above, one must also know the number of documents retrieved by strategy A but not by strategy B, and vice versa. This is obviously a disadvantage since presumably the increase in information supplied can only be achieved by some increase in costs.

Another difference is that the Pool method depends much more heavily on the number of requests than the Squares method. In the former the number of documents needing to be assessed descreases as the number of requests increases. However in the latter the dependence is not so explicit since here the power of the test increases as b + c increases. Clearly b + c depends on the number of requests, but it also depends on how general the requests are, since the more general the requests are presumably the larger b + c will be (since there will be more relevant or retrieved documents). Therefore it is conceivable that the same value of b + c could occur for a set of 200 requests and a set of 250 requests. In this situation there is an increase in power in the Pool method but the power remains unchanged in the Squares method. So, for the Squares method, important than the number of requests. the value of b + c is more

Whichever method is used there is a problem connected with the sampling aspect.

In the Pool method we are interested in the number of documents of known relevance status each request has. If we have k requests the method supplies us with the number of documents (n.) which must be assessed, for the i request, in order to make the comparison 'meaningful' (as defined

earlier), for i = l,..,k. The problem is that the number of documents varies considerably from request to request (see chapter A 2 ) . However we can only make one assessment of the pool and this must result in a sufficient number of documents of known status for each request (some of which may not even have n. documents, especially relevant ones, in the pool).

It would be unduly pessimistic to choose the size of the sample to be assessed on the basis of the requests with the fewest relevant documents.

B39

A more practical method would be to take the average number of documents of known status per request, as if all requests have that number, and sample accordingly.

For example, to take the recall case, suppose there are 3 requests with 25, 50 and 75 relevant documents respectively, and the Pool method requires 20 relevant documents per request. Then if we want the probability of achieving this number to be at least 0.95 in each request then over 80% of the pool must be sampled because of the small number of documents relevant to the first request. However, this would presumably result in assessing over 60 documents relevant to the third request and so the actual power of the test is greater than 0.95, and a lot more resources than necessary have been used. So, as a compromise suppose that each request has 50 relevant documents and then assess a significantly smaller percentage of the pool (< 60%) .

A question which is raised in this connection is whether requests with a small number of relevant documents should be discarded, but this introduces some bias into the request set (see Section C for further discussion on this point).

The situation is much simpler for the Squares method since we are no longer interested in n. but in n (= En.), that is, in the total number of
1 1

relevant documents. This therefore removes all problems of variation between requests and makes the sampling problem much simpler. The only drawback here, as mentioned in the last chapter, is the fact that though the n documents are unlikely to be distinct they have to be treated as distinct as they have to be specifically assessed in relation to their various requests. In other words, n refers to the total of relevant document postings, and there can be no economising on assessment if the total of different documents retrieved is smaller. However, even if there is this slight defect, overall there is no doubt that the Squares method does not create as many sampling problems as the Pool method and, in this way, has a big advantage.

Both methods also require some assumptions to be made about the size of the difference in performance one wishes to be confident about detecting.

E40

In the Pool method this amounts to assuming that p

- p

> 0.05, and this

assumption was made when the sampling theorem was used (see chapter B2 (PA = P^ + P K * P D = P + P # in Table II. A a b B a c In the Squares method this assumption took the form of choosing a lower bound for b Pb • c (i.e. A ) . Thus if, for example, it is assumed that

A > 0.55, then in order to achieve 95% power and 5% significance one needs (to the nearest 100) a total of 5200 documents to be assessed, and specifically 5200 relevant documents to be identified for recall evaluation.

The assumption made by the Pool method has the advantage of being easily interpreted since p
p P

- p

is the difference in recall while

b
P

is the probability of a document of known relevance status being c

b

+

retrieved by strategy A given that it is only retrieved by one strategy.

Suppose now that strategies A and B are being compared for recall, and that there are 300 requests, 1000 documents in the pool, and an average of 25 relevant documents per request; and that we require 5% significance and

95% power. Then there are 7500 relevant documents altogether.

Then, using the Pool method and applying Table 1 in conjunction with Table 3 we obtain that 729 documents should be assessed. Before using the Squares method, a value of p, + p (i.e. I ) must be assumed (given that we I b c R are interested in A > 0.55, say). If I = 0.25 then we require 5200 relevant I R documents to be assessed. The hypergeometric distribution tells us that, in order to have, with 95% probability, assessed 5200 relevant documents, more than 729 documents must be assessed. However if I = 0.50 then we only need I to have assessed 3000 relevant documents, and so the number of assessments required is certainly less than 729. Clearly things become even better if n R = 0.75. So whether or not the number of assessments required is less for the Squares method than for the Pool method depends on the value of I . If I I I R R is quite low then the Pool method will probably be better, while if I is I R large the Squares method becomes more efficient. That is, if the total number of documents retrieved or rejected by both strategies, i.e. the overlap

B41

in their outputs, is likely to be relatively large then the Pool method will require less assessments than the Squares method.

So the Squares method has the advantages of a simpler sampling problem and of requiring less assessments than the Pool method when the strategies are not too similar, as in general could not be expected, while the Pool method does not require as much information from the data as the Squares method, has a more meaningful constraint, and is the better method when the strategies are very similar. On the other hand the Squares method does not have to assume the constancy of P(A > B) across requests, which the Pool method does.

B6

Multi-strategy comparisons

Up till now we have only considered considered comparisons between pairs of strategies. What happens if one wishes to compare t strategies, A , A , . . ,A taken together say, which is a reasonable requirement for

retrieval experiments?

There are three obvious ways of approaching this problem. The first two, Cochran's Q-test and Friedman's test, are standard non-parametric tests, while the third is based on David (1963) . The first two are quite independent of the previous methods; David's method is somewhat similar to the use of the sign test.

Cochran's O-test Cochran (1950) has shown that if there is no difference in, say, recall under each strategy, then if k (the number of requests) is not too small
k -2 k(k - 1) .Z. (G. - G)

Q =

3rl_3
I
1

*

N N kEL. - EL.
I
1

is distributed approximately as chi-square with k - 1 degrees of freedom where G. denotes the number of relevant documents retrieved by strategy A. D ^ th " and L. denotes the total number of strategies which retrieve the I relevant
l

document.

So, the null hypothesis that the strategies have the same level of

B42

performance is rejected at the 5% significance level if Q is in the upper or lower 2.5% tail of the chi-square distribution with k - 1 degrees of freedom.

Note however that the power of Cochranfs test is not known exactly, and so no forecasts can be made about the size of the sample taken for assessment. Another disadvantage is the fact that rejecting H does not reo suit in an ordering of the strategies. Also, Cochran's test would be rather expensive to put into practice since one would have to note, for each relevant document, howe many strategies retrieved it. Friedman's two-way analysis of variance First of all the data must be expressed in a two-way table having k requests and t strategies. If there is no difference between the performance

of the strategies then, providing the number of requests and/or strategies is 2 not too small (k > 9, t > 4) it can be shown (Friedman (1937)) that x is distributed approximately as chi-square with k - 1 degrees of freedom when

*'=
3

kt"

+

D.VRj)2-31t<t +

1)

where R. denotes the sum of the ranks in the j

column.

Again the power of the test is unknown. Siegel (1956) points out that the test compares favourably with the F-test and so should be used in preference to Cochran's Q-test.

Method derived from David (1963): pair subset comparison Cochran's and Friedman's methods are unsatisfactory in that they do not supply enough quantitative information about the power of the test, and hence what their future use would imply for the asessment data cannot be determined., The method for pair subset comparison described below can be applied to future strategy comparisons, if multiple comparisons are wanted, and does not impose any requirements on the assessment data. It is simply applied to judgements of the relative merits of pairs of strategies.

This method was first considered as a possible alternative to the Pool method (in the case where t = 2 ) . It considers the case when there are t objects to be compared in pairs by k different judges. So this can be applied to

B43

our problem with the strategies replacing "objects" and requests replacing "judges". Each judgement is assumed to consist of saying merely which object is best and ties are not permitted.

Suppose t = 2, and denote the number of preferences scored by A. (i = 1,2) by a.. Since under the hypothesis H of equality strategy A. has probability

h. of being preferred in each of its k comparisons with the other strategy, the score a. is a binomial Bi(n,^) variate. So the method is equivalent to the use of the sign test in the Pool method. For the case when t > 2 this method would appear to be a viable alternative to Cochran's and Friedman's tests. For more details see David (1963).

If it is decided to impose an ordering on the strategies by the application of David's method, one must beware of circular triads occurring. That is, A > A , A > A ,A > A . These could result from the fact that there

may be no valid ordering of the three strategies since their performance may depend on more than one characteristic. Also if there is not a significant difference between the strategies then the comparisons (if no ties are allowed) cannot reasonably be expected to be consistent. If the sign test (Pool method) is being used then one can decrease the chances of this by combining the multinomial model and the sign test as follows.

Consider the table for the r Relevant documents retrieved strategy A not retrieved

request: strategy B

retrieved

not retrieved !
n

1

n
n

ll

r

1

i12 2
22

r

1
r J

2ir

n

1

nr
i ft

and consider the test V
P

12r

= P

2ir

VS P

12r *P21r'

Under the null Hypothesis we have that
n 8

i2r

+ n

21r • 21 '
S r=

n

!2r

+

n

2Ir th cell.

li

r

=

n

H

r

'

6

12

r =

and e

r = n r , w h e r e e..r are the expected values in the (i,j)

Under the null hypothesis we have 2 degrees of freedom, and 3 degrees of freedom under the alternative.

B44

Therefore, using Wilks1 likelihood ratio test we obtain that, if H is true, then (n. .r - e..r)

Z -3J
. . i/D That is
(n n

U
e. .r ID

* x2 A
i

!2r - n 2 i r ) 2 l2r
+ n

x

2 ^

2ir

So now if P(A > B) * P(n

r > n (n

r) but

P(A > B) = P(

r - n r) — — n r + n r 12 21

> 3.84)

(rejecting equality at the 5% level).

n r 11 Now r e c a l l of A = 6 A P(A =
nx2(PA-PB)2

+ n r 12 and r e c a l l of B = nr

n r 11

+ nr

n r 21 . So > 3.84)

B) = P(— 2 — > 3.84) > P ( n r ( p n - PJ n12r + n21r A B

Since we are i n t e r e s t e d in the case A > B, t h i s i s e q u i v a l e n t to

Putting z = ^

> • /p H q + p q *A A ^B^B nr A - p B

^ N(0,1)

and assuming p

= 0 . 0 5 , one obtains that the probability is equal to

P(z > (7.68)^ - (0.05)/2in?')

So, although this argument is a better one for deciding if A is better than B for a particular request, and reduces to the chances of circular triads by allowing ties, the cost of this improvement is that the number

of relevant documents required to be assessed increases. Also this method runs into problems if an attempt is made to apply it to precision, and so fallout should be used.

A final warning is that since the significance level is 5% (or 1%), if there are a lot of pairs to be compared the wrong conclusions can be drawn occasionally. The method of overcoming this is known as multiple

comparison (see Duncan (1955)).

B45

Overall/there is no very good way of providing for direct multistrategy comparisons. However David f s method can be used for indirect comparisons, via paired subsets, if more extended strategy evaluation is required. Since it is simply concerned with results of comparisons between pairs of strategies, these comparisons themselves may depend on either the Pool or Squares approaches.