CHAPTER 6 An algorithm for suffix stripping Many strategies for suffix stripping have been reported in the literature (see for example [1] to [6]). The nature of the task will vary considerably depending on whether a stem dictionary is being used, whether a suffix list is being used, and of course on the purpose for which the suffix stripping is being done. Assuming that one is not making use of a stem dictionary, and that the purpose of the task is to improve IR performance, the suffix stripping program will usually be given an explicit list of suffixes, and, with each suffix, the criterion under which it may be removed from a word to leave a valid stem. This is the approach adopted here. The main merits of the present program are that it is small (less than 400 lines of BCPL), fast (it will proces a vocabulary of 10,000 different words in about 8.1 seconds on the IBM 370/165 at Cambridge University), and reasonably simple. At any rate, it is simple enough to be described in full as an algorithm in this paper. Given the speed of the program, it would be quite realistic to apply it to every word in a large file of continuous text, although for historical reasons we have found it convenient to apply it only to relatively small vocabulary lists derived from continuous text files. In any suffix stripping program for IR work, two points must be borne in mind. Firstly, the suffixes are being removed simply to improve IR performance, and not as a linguistic exercise. This means that it would not be at all obvious under what circumstances a suffix should be removed, even if we could exactly determine the suffixes of the words by automatic means. Perhaps the best criterion for removing suffixes from two words W^ and W2 to produce a single stem S, is to say that we do so if there appears to be no difference between the two statements f a document is about W^ ' and f a document is about W 2 f . So if W 1 = 'CONNECTION1 and W 2 = 'CONNECTIONS' it seems very reasonable to conflate them to a single stem. But if W 1 = 'RELATE1 and W 2 = 'RELATIVITY' it seems perhaps unreasonable, especially if the document collection is concerned with theoretical physics. (It should perhaps be added that RELATE and RELATIVITY are conflated together in the algorithm described here.) Between these two extremes there is a continuum of different cases, and given two terms W^ and W 2 , there will be some variation in opinion as to whether they should be conflated, just as there is with deciding the relevance of some document to a query. The evaluation of the worth of a suffix stripping system is correspondingly difficult. The second point is that with the approach adopted here, i.e. the use of a suffix list with various rules, the success rate for the suffix stripping will be significantly less than 100%, irrespective of how the process is -98- evaluated. For example, if SAND and SANDER get conflated, so most probably will WAND and WANDER. The error here is that the -ER of WANDER has been treated as a suffix when in fact it is part of the stem. Equally a suffix may completely alter the meaning of a word, in which case its removal is unhelpful. PROBE and PROBATE for example, have quite distinct meanings in modern English. (In fact these would not be conflated in our present algorithm.) There comes a stage in the development of a suffix stripping program where the addition of more rules to increase the performance in one area of the vocabulary causes an equal degradation of performance elsewhere. Unless this phenomenon is noticed in time, it is very easy for the program to become much more complex than is really necessary. It is also easy to give undue emphasis to cases which appear to be important, but which turn out in practice to be rather rare. For example, cases in which the spelling of the root of the word changes with the addition of a suffix, as in DECEIVE/DECEPTION, RESUME/RESUMPTION, INDEX/INDICES, occur much more rarely in real vocabularies than one might at first suppose. In view of the error rate that must in any case be expected, it did not seem worthwhile to try and cope with these cases. It is not obvious that the simplicity of the present program is any demerit. In a test on the well known Cranfield 200 collection [7] it gave an improvement in retrieval performance when compared with a very much more elaborate program which has been in use in IR research at Cambridge since 1971 (described in [2] and [6]). The test was done as follows: the words of the titles and abstracts in the documents were passed through the earlier suffix stripping system, and the resulting stems were used to index the documents. The words of the queries were reduced to stems in the same way, and the documents were ranked for each query using term coordination matching of query against document. From these rankings, recall and precision values were obtained using the standard recall cutoff method. The entire process was then repeated using the suffix stripping system described in this paper, and the results were as follows: earlier system sion recall 0 57.24 10 56.85 20 52.85 30 42.61 40 40.20 50 39.06 60 32.86 70 31.64 80 27.15 90 24.59 100 24.59 present system precision recall 0 58.60 10 58.13 20 53.92 30 43.51 40 39.39 50 38.85 60 33.18 70 31.19 80 27.52 90 25.85 100 25.85 Clearly the performance is not very different. The important point is that the earlier, more elaborate system certainly performs no better than the -99- present, simple system. (This test was done by Prof. C.J. van Rijsbergen.) The algorithm To present the suffix stripping algorithm in its entirety we will need a few definitions. A consonant in a word is a letter other than A, E, than Y preceded by a consonant. (The fact that the defined to some extent in terms of itself does not TOY the consonants are T and Y, in SYZYGY they are is not a consonant it is a vowel. I, 0 term make S, Z and U, and other 'consonant1 is it ambiguous.) So in and G. If a letter A consonant will be denoted by c, a vowel by v. A list ccc... of length greater than 0 will be denoted by C, and a list vvv... of length greater than 0 will be denoted by V. Any word, or part of a word, therefore has one of the four forms: CVCV CVCV VCVC VCVC ... ... ... ... C V C V These may all be represented by the single form [C3VCVC ... [V] where the square brackets denote arbitrary presence of their contents. Using (VC) m to denote VC repeated m times, this may again be written as [C](VC)m[V]. m will be called the measure of any word or word part when represented in this form. The case m = 0 covers the null word. Here are some examples: m=0 m=1 m=2 TR, EE, TREE, Y, BY. TROUBLE, OATS, TREES, IVY. TROUBLES, PRIVATE, OATEN, ORRERY. The rules for removing a suffix will be given in the form (condition) S1 -> S2 This means that if a word ends with the suffix S1, and the stem before S1 satisfies the given condition, S1 is replaced by S2. The condition is usually given in terms of m, e.g. -100- (m>1) EMENT -> Here S1 is ? EMENT 1 and S2 is null. This would map REPLACEMENT to REPLAC, since REPLAC is word part for which m = 2. The 'condition' part may also contain the following: *S - the stem ends with S (and similarly for the other letters). *v* - the stem contains a vowel. *d *o - the stem ends with a double consonant (e.g. -TT, -SS). - the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP). And the condition part may also contain expressions with and, or and not, so that (m>1 and (*S or *T)) tests for a stem with m>1 ending in S or T, while (*d and not (*L or *S or *Z)) tests for a stem ending with a double consonant other than L, S or Z. Elaborate conditions like this are required only very rarely. In a set of rules written beneath each other, only one is obeyed, and this will be the one with the longest matching S1 for the given word. For example, with SSES IES SS S - > SS -> I - > SS -> (here the conditions are all null) CARESSES maps to CARESS since SSES is the longest match for S1. Equally CARESS maps to CARESS (S1=fSS?) and CARES to CARE (S1r'S f ). In the rules below, examples of their application, successful or otherwise, are given on the right in lower case. The algorithm now follows: Step 1a SSES -> SS IES -> I caresses ponies -> -> caress poni -101- ss -> ss s -> Step 1b (m>0) EED -> EE (*v*) ED -> ties caress cats -> -> -> ti caress cat (*v*) ING -> feed agreed plastered bled motoring sing -> -> -> -> -> -> feed agree plaster bled motor sing If the second or third of the rules in Step 1b is successful, the following is done: AT BL IZ (*d -> ATE -> BLE -> IZE and not (*L or *S or *Z)) -> single letter conflat(ed) troubK ing) siz(ed) hopp(ing) tann(ed) fall(ing) hiss(ing) fizz(ed) fail(ing) fil(ing) -> -> -> -> -> -> -> -> -> -> conflate trouble size hop tan fall hiss fizz fail file (m=1 and *o) -> E The rule to map to a single letter causes the removal of one of the double letter pair. The -E is put back on -AT, -BL and -IZ, so that the suffixes -ATE, -BLE and -IZE can be recognised later. This E may be removed in step 4. Step 1c (*v*) Y -> I happy sky -> -> happi sky Step 1 deals with plurals and past participles. The subsequent steps are much more straightforward. Step 2 (m>0) ATIONAL -> ATE (m>0) TIONAL -> TION (m>0) ENCI (m>0) ANCI (m>0) IZER -> ENCE -> ANCE -> IZE relational conditional rational valenci hesitanci digitizer -> -> -> -> -> -> relate condition rational valence hesitance digitize -102- (m>0) ABLI (m>0) ALLI (m>0) ENTLI (m>0) ELI (m>0) OUSLI (m>0) IZATION (m>0) ATION (m>0); ATOR (m>0) ALISM (m>0) IVENESS (m>0) FULNESS (m>0) OUSNESS (m>0) ALITI (m>0) IVITI (m>0) BILITI -> ABLE -> AL - > ENT -> E -> OUS - > IZE - > ATE -> ATE -> AL -> IVE -> FUL -> OUS - > AL - > IVE -> BLE conformabli radicalli differentli vileli analogousli vietnarnization predication operator feudalism decisiveness hopefulness callousness formaliti sensitiviti sensibiliti -> conformable -> radical -> different -> vile -> analogous -> vietnamize -> predicate -> operate -> feudal -> decisive -> hopeful -> callous -> formal -> sensitive -> sensible The test for the string S1 can be made fast by doing a program switch on the penultimate letter of the word being tested. This gives a fairly even breakdown of the possible values of the string S1. It will be seen in fact that the S1-strings in step 2 are presented here in the alphabetical order of their penultimate letter. Similar techniques may be applied in the other steps. Step 3 (m>0 (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) Step 4 (m>1 AL -> (m>1 ANCE -> (m>1 ENCE -> (m>1 ER -> (m>1 IC -> (m>1 ABLE -> (m>1 IBLE -> (m>1 ANT -> (m>1 EMENT -> (m>1 MENT -> (m>1 ENT -> (m>1 and (*S or *T)) ION -> (m>1 OU -> (m>1 ISM -> revival allowance inference airliner gyroscopic adjustable defensible irritant replacement adjustment dependent adoption homologou communism > > > > > > > > > > > > > > reviv allow infer airlin gyroscop adjust defens irrit replac adjust depend adopt homolog commun ICATE ATIVE ALIZE ICITI ICAL FUL NESS -> -> -> -> -> -> -> IC AL IC IC triplicate formative formalize electriciti electrical hopeful goodness -> -> -> -> -> -> -> triplic form formal electric electric hope good -103- (m>1) (m>1) (m>1) (m>1) (m>1) ATE ITI OUS IVE IZE -> -> -> -> -> activate angulariti homologous effective bowdlerize -> -> -> -> -> activ angular homolog effect bowdler The suffixes are now removed. All that remains is a little tidying up. Step 5a (m>1) E -> (m=1 and not *o) E -> Step 5b probate rate cease -> -> -> probat rate ceas (m>1 and *d and *L) -> single letter controll roll -> -> control roll The algorithm is careful not to remove a suffix when the stem is too short, the length of the stem being given by its measure, m. There is no linguistic basis for this approach, it was merely observed that m could be used quite effectively to help decide whether or not it was wise to take off a suffix. For example, in the following two lists: list A RELATE PROBATE CONFLATE PIRATE PRELATE list B DERIVATE ACTIVATE DEMOSTRATE NECESSITATE RENOVATE -ATE is removed from the list B words, but not from the list A words. This means that the pairs DERIVATE/DERIVE, ACTIVATE/ACTIVE, DEMONSTRATE/DEMONSTRABLE, NECESSITATE/NECESSITOUS, will conflate together. The fact that no attempt is made to identify prefixes can make the results look rather inconsistent. Thus PRELATE does not lose the -ATE, but ARCHPRELATE becomes ARCHPREL. In practice this does not matter too much, because the presence of the prefix decreases the probability of an erroneous conflation. Complex suffixes are removed bit by bit in the different steps. Thus GENERALIZATIONS is stripped to GENERALIZATION (Step 1 ) , then to GENERALIZE (Step 2 ) , then to GENERAL (Step 3 ) , and then to GENER (Step 4 ) . OSCILLATORS is stripped to OSCILLATOR (Step 1 ) , then to OSCILLATE (Step 2 ) , then to OSCILL (Step 4 ) , and then to OSCIL (Step 5 ) . In a vocabulary of 10,000 words, the reduction in size of the stem was distributed among the steps as follows: -104- Suffix stripping of a vocabulary of 10,000 words Number of words reduced in step 1 1: CM 3597 3: 4: 5: Number of words not reduced: 766 327 2424 1373 3650 The resulting vocabulary of stems contained 6370 distinct entries. Thus the suffix stripping process reduced the size of the vocabulary by about one third. -105- REFERENCES 1. LOVINS, J.B. Development of a Stemming Algorithm. Mecanical Translation and Computational Linguistics. 11 (1) March 1968 pp 22-31. ANDREWS, K. The Development of a Fast Conflation Algorithm for English. Dissertation for the Diploma in Computer Science, Computer Laboratory, University of Cambridge, 1971. PETRARCA, A.E. and LAY W.M. Use of an automatically generated authority list to eliminate scattering caused by some singular and plural main index terms. Procedings of the American Society for Information Science,6 1969 pp 277-282. DATTOLA, Robert T. FIRST: Flexible Information Retrieval System for Text. Xerox Corporation, Webster N.Y. 12 Dec 1975. COLOMBO, D.S. and NIEHOFF R.T. Final Report on Improved Access to Scientific and Technical Information through automated vocabulary switching. NSF Grant No. SIS75-12924 to the National Science Foundation. 2. 3. 4. 5. 6. DAWSON, J.L. Suffix Removal and Word Conflation. Michaelmas 1974 pp 33-46. ALLC Bulletin, 7. CLEVERDON, C.W., MILLS J. and KEEN M. Factors Determining the Performance of Indexing Systems 2 vols. College of Aeronautics, Cranfield 1966. -106-