CHAPTER 5 Programs for setting up document test collections The following files are used in the system (and are not necessarily disc resident): MFP1.S.PR0G MFP1.S.RUN MFP1.S.LM0D MFP1.S.TEXTS a a a a pds of program texts. pds of phoenix commands to run each program. load module library of programs. pds of data files used as defaults by the programs. To invoke the members of MFP1.S.RUN as a private library, type LIBRARY MFP1.S.LMOD:COM0 Commands in the library work from a FROM to a TO file, and the defaults for FROM and TO are %C and %0 respectively. TO is set up as /LARGE. (This is phoenix jargon, phoenix being the command language in use on the IBM 370/165 at Cambridge.) STORE should be adjusted upwards from the usual default of 120 for online work when substantial sorting or data storage is required by the programs. When the programs do run out of store, the messages give a good indication of how much was really needed. -68- e.g. DECHAR FROM .VASWANI.SRCE1 DECHAR FROM &S TO &T This takes the original text of a standard test collection (FROM), and replaces all non-letters in the document abstracts by space. Lower case letters are forced to upper case. The syntax of the input is validated, in that document numbers must be 1,2,3f ... in turn, documents must be terminated with / and the whole collection with / With the FROM file: 1 Compact memories have flexible capacities. A digital data storage system with capacity up to 24000 bits and random and or sequential access is described. / 2 An electronic analogue computer for solving systems of linear equations. Mathematical derivation of the operating principle and stability conditions for a computer consisting of amplifiers. / 00 -69- Satellite observations of electrons artificially injected into the geomagnetic field. The geomagnetically trapped electrons resulting from the high altitude nuclear detonations of the ARGUS experiment have been observed on four radiation detectors in satellite explorer. The measurements for several satellite passes through the ARGUS shells are described and the significance of the results is summarized. / / DECHAR produces the TO file: 1 COMPACT MEMORIES HAVE FLEXIBLE CAPACITIES A DIGITAL DATA STORAGE SYSTEM WITH CAPACITY UP TO BITS AND RANDOM AND OR SEQUENTIAL ACCESS IS DESCRIBED / 2 AN ELECTRONIC ANALOGUE COMPUTER FOR SOLVING SYSTEMS OF LINEAR EQUATIONS MATHEMATICAL DERIVATION OF THE OPERATING PRINCIPLE AND STABILITY CONDITIONS FOR A COMPUTER CONSISTING OF AMPLIFIERS / 100 SATELLITE OBSERVATIONS OF ELECTRONS ARTIFICIALLY INJECTED INTO THE GEOMAGNETIC FIELD THE GEOMAGNETICALLY TRAPPED ELECTRONS RESULTING FROM THE HIGH ALTITUDE NUCLEAR DETONATIONS OF THE ARGUS EXPERIMENT HAVE BEEN OBSERVED ON FOUR RADIATION DETECTORS IN SATELLITE EXPLORER THE MEASUREMENTS FOR SEVERAL SATELLITE PASSES THROUGH THE ARGUS SHELLS ARE DESCRIBED AND THE SIGNIFICANCE OF THE RESULTS IS SUMMARIZED / / -70- e.g. DESTOP WITH .OWN:STOPLIST DESTOP FROM &T TO .VAS.AO This removes from the text of FROM the words supplied in a given stop list (WITH), together with words consisting of only one or two letters. The stop list may contain one or two letter words, but these are redundant. (Here and below words are defined as upper case letter sequences bounded by non-letters.) With the FROM file: 1 COMPACT MEMORIES HAVE FLEXIBLE CAPACITIES A DIGITAL DATA STORAGE SYSTEM WITH CAPACITY UP TO BITS AND RANDOM AND OR SEQUENTIAL ACCESS IS DESCRIBED / 2 AN ELECTRONIC ANALOGUE COMPUTER FOR SOLVING SYSTEMS OF LINEAR EQUATIONS MATHEMATICAL DERIVATION OF THE OPERATING PRINCIPLE AND STABILITY CONDITIONS FOR A COMPUTER CONSISTING OF AMPLIFIERS / 100 SATELLITE OBSERVATIONS OF ELECTRONS ARTIFICIALLY INJECTED INTO THE GEOMAGNETIC FIELD THE GEOMAGNETICALLY TRAPPED ELECTRONS RESULTING FROM THE HIGH ALTITUDE NUCLEAR DETONATIONS OF THE ARGUS EXPERIMENT HAVE BEEN OBSERVED ON FOUR RADIATION DETECTORS IN SATELLITE EXPLORER THE MEASUREMENTS FOR SEVERAL SATELLITE PASSES THROUGH THE ARGUS SHELLS ARE DESCRIBED AND THE SIGNIFICANCE OF THE RESULTS IS SUMMARIZED / / and the WITH file: A, ABOUT, ABOVE, ACROSS, AFTER, AFTERWARDS, AGAIN AGAINST, ALL, ALMOST, ALONE, ALONG, ALREADY, ALSO ALTHOUGH, ALWAYS, AMONG, AMONGST, AN, AND, ANOTHER ANY, ANYHOW, ANYONE, ANYTHING, ANYWHERE, ARE, AROUND AS, AT, BE, BECAME, BECAUSE, BECOME, BECOMES BECOMING, BEEN, BEFORE, BEFOREHAND, BEHIND, BEING, BELOW BESIDE, BESIDES, BETWEEN, BEYOND, BOTH, BUT, BY CAN, CANNOT, CO, COULD, DOWN, DURING, EACH EG, EITHER, ELSE, ELSEWHERE, ENOUGH, ETC, EVEN EVER, EVERY, EVERYONE, EVERYTHING, EVERYWHERE, EXCEPT, FEW FIRST, FOR, FORMER, FORMERLY, FROM, FURTHER, HAD HAS, HAVE, HE, HENCE, HER, HERE, HEREAFTER HEREBY, HEREIN, HEREUPON, HERS, HERSELF, HIM, HIMSELF DESTOP produces as TO file: 1 COMPACT MEMORIES FLEXIBLE CAPACITIES DIGITAL DATA STORAGE SYSTEM CAPACITY BITS RANDOM SEQUENTIAL ACCESS DESCRIBED / 2 ELECTRONIC ANALOGUE COMPUTER SOLVING SYSTEMS LINEAR EQUATIONS MATHEMATICAL DERIVATION OPERATING PRINCIPLE STABILITY CONDITIONS COMPUTER CONSISTING AMPLIFIERS / 100 SATELLITE OBSERVATIONS ELECTRONS ARTIFICIALLY INJECTED GEOMAGNETIC FIELD GEOMAGNETICALLY TRAPPED ELECTRONS RESULTING HIGH ALTITUDE NUCLEAR DETONATIONS ARGUS EXPERIMENT OBSERVED FOUR RADIATION DETECTORS SATELLITE EXPLORER MEASUREMENTS SATELLITE PASSES ARGUS SHELLS -72- DESCRIBED SIGNIFICANCE RESULTS SUMMARIZED / / The default value of WITH is MFP1.S.TEXTS:STOPLIST VOCAB e.g. VOCAB FROM .VAS.AO TO .VAS.VOCAB STORE MOO VOCAB TO &VOC This sorts the words in FROM and sends to TO a vocabulary list in the form: ABSORPTION ACCELERATION ACCESS ACCOMPANIED ACCOUNT ACCURACY ACHIEVED ACOUSTIC ADDER ADJUSTABLE ADVERSELY AFFECT AFFECTING AGREEMENT AIR ALLEN ALTITUDE AMERICAN -73- AMMONIA AMPLIFICATION STEM e.g. STEM FROM &VOC TO &SVOC STEM is a simple suffix stripping program. The output has the form: ABSORPT ACCELER ACCESS ACCOMPAN ACCOUNT ACCURAC ACHIEV ACOUST ADDER ADJUST ADVERS AFFECT AFFECT AGREEMENT ABSORPTION ACCELERATION ACCESS ACCOMPANIED ACCOUNT ACCURACY ACHIEVED ACOUSTIC ADDER ADJUSTABLE ADVERSELY AFFECT AFFECTING AGREEMENT AIR ALLEN ALTITUD AMERICAN AMMONIA AMPLIF AMPLIF AMPLIF AIR ALLEN ALTITUDE AMERICAN AMMONIA AMPLIFICATION AMPLIFIER AMPLIFIERS -74- AMPLITUD ANALOGU ANALYS ANALYS ANALYS AMPLITUDE ANALOGUE ANALYSED ANALYSER ANALYSIS Each word in the FROM file is output on a separate line preceded by a character string derived from the word, which is used to bring words together into conflation groups. The character string is printed in a field of width 24, and has a maximum size of 22 characters. (If necessary, characters are chopped out of the middle to bring it down to size.) The TO file should be sorted to bring the words into their conflation groups, e.g. STEM FROM &VOC SM DICMAT e.g. DESLASH FROM .VASVOC.TERMS TO &U DICMAT FROM .VAS.AO TO .VAS.BO WITH &U DICMAT WITH .VTERMS STORE 250 The WITH file (the TO output from TERMNOS) consists of every word in the text of the FROM file, arranged in alphabetical order, and followed by an integer. DICMAT replaces each word in the FROM file by its corresponding integer, and sends the result to TO. Words not in WITH but in FROM are printed out as not being in the dictionary, but are otherwise ignored. With the FROM file: -75- 1 COMPACT MEMORIES FLEXIBLE CAPACITIES DIGITAL DATA STORAGE SYSTEM CAPACITY BITS RANDOM SEQUENTIAL ACCESS DESCRIBED / 2 ELECTRONIC ANALOGUE COMPUTER SOLVING SYSTEMS LINEAR EQUATIONS MATHEMATICAL DERIVATION OPERATING PRINCIPLE STABILITY CONDITIONS COMPUTER CONSISTING AMPLIFIERS / •••• 100 SATELLITE OBSERVATIONS ELECTRONS ARTIFICIALLY INJECTED GEOMAGNETIC FIELD GEOMAGNETICALLY TRAPPED ELECTRONS RESULTING HIGH ALTITUDE NUCLEAR DETONATIONS ARGUS EXPERIMENT OBSERVED FOUR RADIATION DETECTORS SATELLITE EXPLORER MEASUREMENTS SATELLITE PASSES ARGUS SHELLS DESCRIBED SIGNIFICANCE RESULTS SUMMARIZED / / and the WITH file: ABSORPTION 1 ACCELERATION 2 ACCESS 3 ACCOMPANIED 4 ACCOUNT 5 ACCURACY 6 ACHIEVED 7 ACOUSTIC 8 ADDER 9 ADJUSTABLE 10 ADVERSELY 11 -76- AFFECT 12 AFFECTING 12 AGREEMENT 13 AIR 14 ALLEN 15 ALTITUDE 16 AMERICAN 17 AMMONIA 18 AMPLIFICATION .... 19 DICMAT produces as TO file: 1 105 424 268 82 189 161 687 708 82 63 566 630 3 174 220 20 113 664 708 388 230 415 173 479 538 683 117 113 128 19 / 100 613 294 317 471 422 174 / 471 220 37 340 263 294 737 220 600 16 468 180 36 243 280 561 178 613 245 613 500 36 637 646 600 699 e.g. BSORT OPT SF BSORT FROM .VAS.BO TO .VAS.B1 OPT SN This takes a file in ab-form (FROM), and adjusts the b's for each a. 'ab-form' means a b a b • a b / b ... b / b ... b / •• b ... b / where the a f s and b's are integers. The OPT parameter may contain S, Ff B and N. S N F B causes causes causes in the causes single the b's to be sorted in ascending order. duplicate b's (i.e. a b equal to the previous b) to be discarded. a list of b's to be replaced by a single b with frequency count output. the output to be "brief", that is, multiple spaces are reduced to a space. This is recommended for very large data collections. The N and F options take effect after sorting. N and F together cause all the frequency counts to be 1. So if the FROM file is: -78- 1 105 424 268 82 189 161 687 708 82 63 566 630 3 174 / 2 220 20 113 664 708 388 230 415 173 479 538 683 117 113 128 19 / 100 613 294 317 471 422 174 / 471 220 37 340 263 294 737 220 600 16 468 180 36 243 280 561 178 613 245 613 500 36 637 646 600 699 / OPT=S produces the output: 3 566 19 415 63 630 20 479 82 687 113 538 82 708 / 113 664 105 117 683 161 128 708 100 16 245 471 36 263 500 36 280 561 37 294 600 174 294 600 178 317 613 -79- 699 737 / OPT=SF produces the output: 3 174 630 19 173 479 1 1 1 1 1 1 63 189 687 20 220 538 82 268 708 113 230 664 2 1 105 424 117 388 683 161 566 1/ 2 1 1 128 415 708 1 1 1 / 100 16 180 280 468 613 -> -> -> DISC* INTERESTED NUMERICAL REST*** .TERMNOS, and 300K the words DISC, INTEREST, NUMERI DICMAT was rerun.) DICMAT WITH .TERMNOS FONT not in diet TRANSISTORISED not in diet SEND not in diet SEND not in diet OPTIMISING not in diet PRETREATMENT not in diet WISH not in diet WISH not in diet TRANSISTORISED not in diet 300K -95- 732 words read 440 (machine) words of unused workspace BMAP WITH .TERMRANK 7491 terms read from WITH file Maximum term was 7491 100 docs read BSORT TO .QUERIES.AB OPT SN 100 units read 130K Appendix A. Other useful programs. SCALE e.g. SCALE TO &B AND &SCALE This file file This takes a file in ab-form (FROM) and produces on TO a version of the FROM with the a ! s forming a simple ascending sequence 1,2,3, ... The AND (which must be present) contains a simple list of the original a f s. file can be used for mapping b f s with the BMAP program, e.g. SCALE FROM .TERMS.A1 TO .TERMS.B1 AND &SCALE BMAP FROM .D0CS.A1 TO .D0CS.B1 WITH &SCALE STORE 190 BMAP FROM .QS.A1 TO .QS.B1 WITH &SCALE STORE 190 -96- RELS e.g. RELS FROM .A TO .B This puts the text form of a test collection (the output of DECHAR) into 3-tuple relation form for input to the CODD database. The columns are: )CNO WORDNO WORD where WORD is the WORDNOth word in the document with number DOCNO. -97-