CHAPTER 7 The CAMIR system CAMIR is a collection of programs for running information retrieval experiments. The files handled by CAMIR consist of binary records in VBS format. Each record consists of N words (i.e. 4N bytes) where N is at least 2 and has no effective upper bound. The first word contains N, the length of the record. The second word contains a numeric identifier for the record in the file. This will be called the id-word. Words 3 to N contain data for the particular record. All, or almost all, data handled by CAMIR goes into this form, even when it not quite convenient. In fact this form of data is quite suitable for almost all of the experimental work that we have been doing. Representation of documents A document is represented by a record in the form / d t j w j t2 v*2 ... t n w n / - d is the record id-word, and is simply the document number (1, 2, 3 ... ) . t-j, t2» ... t n are the term numbers for the document, sorted into increasing order, and w i is a weight for the term t i # For many experimental purposes the w^ will be irrelevant, and can be set to zero. Occasionally, however, it will have significance. Thus w^ could be the beta weight of Harter's model, which measures the usefulness of term t^ as a discriminator of document d. This form of document representation is not very compact, but can be processed rapidly. The w^ will almost always be integers, and frequently w^ will be the largest integer contained in 1000f^, where f^ is a computed floating point number giving the weight of t^ in d. If t n is the last term, the record length will be 2n+2. Representation of the other common files The file of documents gives rise to an inverted documents file which is represented in an exactly analogous way. / t d j w j d 2 w 2 ... d n w n / - The terms t (1, 2, 3 ... ) occur in the given documents with the given -107- weights. A common use of v^ is as a measure of the frequency of occurrence of t in d^. The file of queries has records of the form / q t 1 w 1 t 2 v*2 ... t n w n / q is the query number, and takes the values 1, 2, 3 ... through the file. Here it is very likely that significance will be attached to the Wj^. The file of relevance assessments, or rels, has the form / q d1 w j d 2 w 2 ... d n w n / The d^ are the relevant documents for the query q. The w^ are effectively dummy here, and may have the value zero. The representation of other files will be described as they occur. The records in the files contain no indication of their type, which is no doubt a mistake. Care must therefore be taken not to confuse, say, a file of queries with a file of rels, since the system will not trap such an error. Errors of this kind can lead to program collapses. The traditional form of IR data at Cambridge is as a series of numbers in text form. A group of numbers is terminated by a solidus, and a file of numbers by a further solidus. (This is described in [1].) To go from this form to the CAMIR form, use the command C MFP1.IR:BIN FROM TO and to go the other way, use C MFP1.IR:TXT FROM TO traditional form> The defaults for FROM are given in terms of 370/165 at Cambridge. the description which and TO are %C and %0 respectively. (These commands phoenix, the command language in use on the IBM There is a certain limited dependence on phoenix in follows.) The CAMIR command language CAMIR expresses its commands in a language which follows closely the form of the TRIPOS command language [2]. A command is followed by a list of keywords and arguments up to a semicolon. A keyword may have an associated list of qualifiers. The possible qualifiers are: A - The argument must be present. K - If the argument is present, the corresponding keyword must be -108- present. S - The argument is absent (i.e. the keyword simply supplies a switch). I - The argument must be an integer. An integer is a digit string, optionally preceded by a minus sign. Thus, qualifiers AK mean that argument and keyword must always be present. K without A means that the argument together with the keyword may be present, but not the argument without the keyword. If K is absent the keyword may be omitted, in which case the arguments so employed in a command are interpreted positionally: A without K means that the argument must be present but the keyword is optional, and the absence of both A and K means that the argument is optional, and when it is used its keyword is also optional. In describing a command, the qualifiers are written after their keyword and separated from it by a solidus. Thus the syntax of the EXPAND command may be described as follows: EXPAND FROM/ TO/ WITH/AK; Here the WITH keyword, together with its argument, must always be present. The FROM keyword is optional, and its argument is also optional. TO is like FROM. The following are therefore equivalent EXPAND EXPAND EXPAND EXPAND FROM A TO B WITH C; A TO B WITH C; FROM A B WITH C; A B WITH C; The arguments can also be reordered, so the above set is also equivalent to EXPAND WITH C FROM A TO B; EXPAND TO B A WITH C; The arguments may also have default values. For EXPAND, C is the default value of the arguments of keywords FROM and TO. This may be expressed by writing EXPAND FROM/(C) TO/(C) WITH/AK; And so the following are all equivalent EXPAND EXPAND EXPAND EXPAND FROM C TO C WITH A; C C WITH A; TO C WITH A; WITH A; A few commands (e.g. DEL) do not quite have this shape. These exceptions will be explained as they occur. Note that an argument may be any sequence of characters not containing a space or newline character. -109- Keyword arguments as structure names or ddnames The argument of a keyword will frequently be a reference to a CAMIR file. If the argument is prefixed with the / character, it gives the ddname of the / file. For example EXPAND FROM //QS TO #QS2 WITH #MST; QS and MST are the ddnames of two files on secondary storage. QS2 is the ddname of a new output file on secondary storage. A distinction is made in CAMIR between large and small files. The document and inverted document files are classed as large files. When reference is made to a large file by its ddname, the character / may be omitted. Small files constitute almost / everything else, and after reading a small file by using the argument //QS, say, it will be in core as a structure with name QS. Output may also go either to an external file, using the / character, or to an internal file / with the / character omitted. The command COPY, which has the syntax: / COPY FROM/(C) TO/(C); can be used for transferring structures between internal and secondary storage, and for making additional copies of structures. Thus COPY FROM //QS TO A; reads the file from ddname QS and makes a copy of it as structure A. At the end of the operation, the file is also left in core as structure QS, so that in fact two copies of the file are created as structures. Equally, COPY FROM QS TO //OUT; outputs the structure QS as a file with ddname OUT. The structure QS is not itself changed. Unwanted structures may be deleted by the DEL command, which is followed by a structure list. e.g. DEL C QS QS2 MST ... ; The current structure Many of the keyword arguments default to the value C. C can be thought of as a current structure, which will be utilised unless some other structure is explicitly named. -110- BMATCH (Best match) Syntax: BMATCH FROM/(C) DOCS/K(C) TO/(C) CUTOFF/KI(10) DWEIGHTS/S QWEIGHTS/S; The FROM structure is a file of queries (here and elsewhere, 'queries1 refers to any record in the general form / q t j w-| t 2 w 2 ... /) and DOCS supplies the source of the document file. Each query and document gives rise to a match value. This will be one of the following: a) The number of common terms between the query and document (if DWEIGHTS and QWEIGHTS are both uset). The sum of the query weights for the common terms (if QWEIGHTS only is set) . The sum of the document weights for the common terms (if DWEIGHTS only is set). The sum of the product of document weight and query weight for the commom terms (if DWEIGHT and QWEIGHT are both set). b) c) d) CUTOFF gives the maximum number of documents to be included per query in the resulting TO structure. The TO structure itself consists of records of the form / Q d | m^ ^2 m 2 ••• ° k m k ^ * This gives the set of documents d j ... d k which give the highest match values for the query q. They are arranged in decreasing m i order, or, when the rr^ values are the same, in increasing d i order. BWEIGHTD (Beta weights for documents) Syntax: BWEIGHTD FROM/A TO/A WITH/AK; The WITH file contains the 1, m, h values for each term in the collection, derived from Harterfs analysis of term frequencies. WITH will have been produced by the HARTER command (q.v.). FROM is the document file, and TO will be the same document file with Harter's beta weights added in. If term t has parameters 1, m, h in the Harter model, and its frequency in a particular document is k, its beta weight will be -111- s<| + sj + s2 where 1 - m (i.e. P(d in I|k) + z) sqrtd + m) 1 s^ = h e s 2 = (1-h) e~ m m k See p 284 of [33. The beta weights are expressed in fixed point form by multiplying by 1000 and taking the integer part. CHOOSE Syntax: CHOOSE FR0M/(C) T0/(C) WITH/K(RELS) COMPLEMENT/S NULLTONULL; This is typically useful when the FROM structure is a set of document-match values, derived from BMATCH, and the WITH structure is a set of relevance assessments. The TO structure can then be a reduced version of the FROM structure, containing only relevant documents. More generally, FROM is a structure whose nth record is / q d-| m-. d 2 m 2 ... d^ m^ / and RELS is a structure whose nth record is / q D^ x j D 2 x 2 .. . D ^ x j / < - -^ The D^ must be in increasing order, the x^ are dummy. With COMPLEMENT and NULLTONULL both unset, the nth record of TO will be / q d | T m-j ! ... / where every d^ m^ for which d^ is not also a D^ has been removed. If COMPLEMENT is set, every d, m, for which d, is also a D, is 1 1 1 J removed. If NULLTONULL is set, the nth record of TO will be / q / whenever the nth record of RELS is / q /. This is for certain evaluation purposes where one wishes to kill off queries which have no relevant documents. -112- COPY Synta x: Q OPY FROM/CO TO/(C); This has already been descibed. A copy is made of the FROM structure and it is written to the TO structure. DEBUG Syntax: DEBUG; This causes a global debugging flag to be switched on. Subsequent commands may test for this flag and produce monitoring information if it is set. DEL (Deletion of structures) Syntax: DEL ; The given structures are deleted from core, thereby making more space available to the system. EVAL (produces a P-R graph from the lin-file) Syntax: EVAL FROM/(C) TO/(SYSOUT); FROM is a structure called a lin-file, which will have been produced by MAKELIN (q.v.). TO gives the ddname of the text file to which the precisionrecall graph derived from the lin-file is directed. By default it goes to SYSOUT, the output log stream for CAMIR. The precision-recall values are worked out by producing precision values for each query with a non-null list of rels at standard recall positions (0.1, 0.2, ... 1.0) and then averaging. This is fully described on pages 194-195 of [4] and in chapter 7 of L51. The "graph" in fact simply consists of a listing of the pairs of R-P values. -113- EX (examine structures) Syntax: EX; The structures currently in the system are printed out by name, together with the number of words of core that they consume. EXPAND (query expansion via an MST of terms) Syntax: EXPAND FROM/(C) TO/(C) WITH/AK A/IK B/IK C/IK PRINT/S; FROM is a structure representing a set of queries. The resulting TO structure is the same set of queries, but with additional terms given by the term classification of WITH. WITH consists of a single record, as follows / 1 t* u* m* tp Up nip * • • tk uk mk / The record id-word, 1, is of no significance here, but it is important not to omit it. The record provides a term classification. Term t* is linked to term u^, and m^ gives the strength of the link. Frequently this record will correspond to an MST of terms, in which case two adjacent nodes in the tree, t and u, which are linked together with strength m, will give rise to two triplets in the record, namely turn u t m The initial order of the triplets in the record is irrelevant - they are reordered after input by increasing t, and when the t is constant by decreasing m, and when the m is constant by increasing u. m must be in integer form. A, B and C are used to control the expansion process. Each query term t will have a class of terms [u] associated with it, where t u occurs as the first two terms of a trio in the WITH structure. In other words [u] is the class of terms to which t is linked. The first C of these terms with the highest m values are taken, or the whole set [u] if it contains fewer than C terms. -114- This is done for each term t in the query, and the result is a class of terms obtained by collecting all of these u-terms together. This class, U, contains each u-term and its corresponding m value. If two members have the same u, e.g. u m-| and u rr^, the pair with the smaller m is removed. Then the class is arranged by decreasing m, and all but the first AQ+B terms are discarded, where Q is the size of the original query. Thus there are two controls in operation. When expanding t, only the strongest C links are taken. Then the query is expanded only by the linear factor AQ+B of the original size. C is given as an integer, A and B as percentages (i.e. multiplied by 100) e.g. ... A 50 B 125 C 2; By default A, B and C are infinite. If A is set, B defaults to zero, and if B is set A defaults to zero. GETRECS (forming a subfile) Syntax: GETRECS FR0M/K(D0CS) TO/ WITH/AK; The WITH structure contains a single record of the form / 1 id1 id 2 ... id k / The id-entries give a list of id-words which identify a set of records in the FROM file. The resulting TO file contains just this set of records. The list id-, ... id^ must have the same order as their corresponding records in the FROM file. The FROM and TO files are large files - i.e. not representable by in-core structures. GETTERMS (term statistics from a feedback set) Syntax: GETTERMS FR0M/(C) TO/(C) D0CS/K(D0CS); The FROM structure is a file of documents (typically known relevant documents) in which the id-word may be interpreted as a query number. Thus the nth record of FROM has the form / q d j w<| d 2 w 2 ... d R w R / > The W^ a r e dummy. DOCS s u p p l i e s the source of the document f i l e . The j output s t r u c t u r e TO c o n t a i n s information about the terms in the documents d i f and w i l l be c a l l e d the t - i n f o f i l e . From a t - i n f o f i l e feedback -115- weights may be computed (see NEWQS). The nth record of TO has the form / q R t, r, t 2 r 2 ... t. r, / t-| ... t m are the terms in the documents d j ... d^ and r^ is the number of these documents in which t^ appears. The t^ are in increasing order. HARTER (computes 1, m, h of the Harter term model) Syntax: HARTER FROM/A TO/A; The FROM file is the inverted documents file, in which the weight w ^ . attached to each term t^ is the term's within-document frequency. In the Harter model, the probability that a term occurs k times in a document is given by e" 1 l k P(k) = h k! + (1-h) k! e" m m k The parameters 1, m and h are computed for each term using the data giving its distribution in the inverted document file, and with Harter!s estimation method (see JASIS July-August 1975 p 202). The output in the TO structure consists of the four records / 1 h^ h 2 ... hj / / <-. 1 1 lp ... ±rn / / 3 mi mp ... m«p / / 4 n j n 2 ... n T / h^ 1^, m t are the values of pi , 1 and m for term t. The total number of terms in the collection is T. n t is the number of documents in which term t occurs, a statistic which is useful in later weight estimation. The values in the first three records are given as 32-bit floating point numbers. So the command WRITE, which expects fixed point integers everywhere, cannot be used with this particular structure. Instead HSTATS is used. If DEBUG has been obeyed, the various stages of the evaluation process are printed out. -116- HSTATS (prints out the Harter parameters) Syntax: HSTATS FROM/ TO/CSYSOUT); This prints out on dd TO the structure FROM, which is the output from HARTER. The result is in five columns: i, 1, m, h, z, i is the term number. z is (l-m)/sqrt(1+m). Each of 1, m, h and z are printed out as the integer part of the original value multiplied by 100,000. HWEIGHT (term weighting based on the Harter model) Syntax: HWEIGHT FR0M/(C) T0/(C) WITH/AK M0DE/AIK; FROM is a set of input queries, and TO a set of output queries, identical except for the new weights. The weights are expressed as integers after some suitable scaling. The WITH structure is the output from HARTER (q.v.). MODE supplies an integer which determines the weighting scheme. The weight is a function of 1, m, h and n for each term. The mode-rfunction correspondence has been altered continuously to run various experiments. At present we have the following: mode 1 2 5 6 7 8 9 10 etc. function where D is the document count log(D/n) 1+log(D/n) h=0 -> 0, m=0 -> 10, log(l/m) 1/sqrt(n) (1+log(D/n))2 (l-m)/sqrt(1+m) = z 1 z(1+log(D/n)) where z is as in 8. MAKELIN (making the lin-file) Syntax: MAKELIN FR0M/(C) MATCH/K(MATCH) D0CS/K(D0CS) T0/(C) DWEIGHTS/S QWEIGHTS/S EXCLUDE/K; MAKELIN is used in conjunction with RMATCH. RMATCH produces as output a match file, which is used as the argument of MATCH in MAKELIN. To be meaningful, the queries in the FROM structure, the document file of DOCS, and the settings of DWEIGHTS and QWEIGHTS must be the same as in the corresponding RMATCH command. A record from the TO structure has the form -117- / q d , p-, d 2 p 2 ... d n p n / d j ... d n are the documents associated with query q. If the documentmatch pairs for every document in the collection are collected for query q, they may be ordered by decreasing match value, and, when the matches are the same, by increasing document number. Then p^ gives the position down the list of document d^ in this ordering. If EXCLUDE is set, it determines a file whose nth record has the form / q D j w j D 2 w 2 ... D k w k / - The w^ are dummy. D j to D^ specify a list of files which are to be regarded as absent from the system in the construction of the lin-record for q. This option is useful in residual evaluations of retrieval performance. NEWQS (feedback query weighting) Syntax: NEWQS FROM/(C) TO/(C) WITH/AK TTOT/AK WEIGHT/K; The structure FROM supplies a list of queries, and the resulting structure TO is identical except for the new weights (cf. HWEIGHT). The WITH structure is a t-info file produced by GETTERMS (q.v.). TTOT is a structure composed of one record: / 1 n j n 2 ... n p / and n^ is the number of documents in which term t occurs. (Cf the fourth record generated by HARTER, which could be used for TTOT here.) WEIGHT supplies the appropriate weighting scheme. The possible values are: WEIGHT MILLER4 The weight for t- is (r± log (ni - r± + 0.5)(R - r± + 0.5) where R, r- are from the t-info record, and D is the collection size. WEIGHT EIQ + 0.5XD - n± - R + r± + 0.5) -118- The weight is N0Q D Nn + log N D log N N01 D log 0. N N1Q D »1. N .0 log 0. N .0 »1. H # 1 .1 where N ^ = r^ N Q1 = R - r i N 1 0 = n± - r± N oo = D - n i 0 . = N 00 + N R + r i - and N 01 etc The weight s are scaled and put into integer form. PARS (sett ing up the collection par,ameters) Syntax: PARS DCOUNT/IK TCOUNT/IK; This sets up the two parameters D, the number of documents in the system, and T, the number of terms in the system. It should always be used prior to any sequence of CAMIR commands. DCOUNT can be readjusted for the benefit of weight calculation, e.g. PARS DCOUNT 2500; NEWQS ... ; PARS DCOUNT 11429; RMATCH Syntax: RMATCH FR0M/(C) D0CS/K(D0CS) RELS/K(RELS) T0/(C) DWEIGHTS/S QWEIGHTS/S; The FROM structure is a set of queries, the RELS a set of relevance assessments. DOCS gives the source of the document file, and DWEIGHTS, QWEIGHTS have the same interpretation as in BMATCH. The resulting TO structure is a copy of the RELS structure with records of the form / q d | m>| c^ m 2 ... d n m n / where m^ is the match value between query q and document d i# -119- WEIGHT Syntax: WEIGHT FROM/(C) TO/(C) WITH/AK MULT/S; The structure FROM is a set of queries, TO has the same shape as FROM with weights taken from WITH. WITH is a single record in the form / 1 W^ Wo . . i W j / w i is then the weight for term i. If MULT is set, the w^ weight is multiplied into the term weight already in the FROM structure to get the new term weight. WEIGHTD Syntax: WEIGHTD FROM/A TO /A WITH/AK MULT/S; FROM refers to a document file. TO is a corresponding output file with new weights taken from WITH. WEIGHTD does for the document set exactly what WEIGHT does for the smaller query set. The distinction is that the document file cannot be an in-core structure. WRITE Syntax: WRITE FROM/(C) TO/CSYSOUT); The FROM structure is written to the dd TO in "traditional" form. Example of the use of CAMIR As an example, suppose that we want to get a listing of the 10 documents closest to each relevant document of query 37 in the NPL test collection, closeness being measured by adding up the i.d.f. weights for the common terms. To pull out the reldoc numbers for query 37, we do the following: SET MFP1 INPUT 1 37 / /* C .IR:BIN TO &S &S now contains a suitable WITH file for the command GETRECS. We run this -120- command using the following piece of Phoenix: MFP1.IMODS:L($S=200|0 %D='$40K250G$' %LIB=MFP1.IMODS FLIB=SYS1.F0RTLIB SYSPRINT=%M/N SYSIN=%H! /VBS U=&U/N QS=.VAS.XRELS S=&S) GETRECS T #U F O #QS WITH #S; O RM and t h i s can e q u a l l y be done by the command sequence: SET P IRCOM SET P IRENV C .IR:DO 'GETRECS TO #U FROM #QS WITH #S 'U=&U/N QS=.VAS.XRELS S=&S S i m i l a r l y , to pull out and print the r e l d o c s themselves, we do: SET P SET P C .IR C .IR PRINT I C M 'GETRECS T #R WITH //LIST RO O IRENV 'R .VAS.Q37DOCS/N LIST & U DO TXT FROM .VAS.Q37DOCS TO &A &A - where DOCS has been set up by a JCL line: //DOCS DD UNIT=TAPE9,... PARS has not been obeyed here, since the GETRECS clearly does not require the TCOUNT or DCOUNT values. Getting the set of closest 10 documents to each of the reldocs can now be done by //DOCS DD DSN=VAS.XDOCS.ABF,LABEL=2,VOL=SER=MFP106,UNIT=TAPE9 SET MFP1.VAS MFP1.iIM0DS:L( %D = '$35K250G$' $S=200|0 %LIB=MFP1. IMODS FLIB=SYS1.F0RTLIB HWEIGHTS=.XHARTER QS=.Q37DOCS MATC H =.Q37MATC H/N/VBS SYSPRINT=%M/N SYSIN %Hl PARS DCOUNT 11429 TCOUNT 7491; HWEIGHT FROM #QS WITH //HWEIGHTS MODE 2; DEL 9S HWEIGHTS; BMATCH TO //MATCH CUTOFF 10 QWEIGHTS; 1 C MFP1.IR:TXT F O .Q37DOCS T & RM O A PRINT & A -121- -122- REFERENCES 1. SPARCK JONES, K. and BATES, R.G., Research on automatic indexing 1974-1976, British Library Research and Development Report 5464. Computer Laboratory, Cambridge, 1977. 2. RICHARDS, M. et al., TRIPOS - a portable operating system for minicomputers. Computer Laboratory, Cambridge, 1979. 3. HARTER, S.P., A probabilistic approach to automatic keyword indexing. Journal of the ASIS, £6, 1975. 4. HARPER, D.J. and VAN RIJSBERGEN, C.J., An evaluation of feedback in document retrieval using co-occurrence data. Journal of Documentation, 34, 1978. 5. VAN RIJSBERGEN, C.J., Information retrieval. Second edition. Butterworths, London, 1979. •123-