3 3.1

S - i z fceimi l a , n g Introduction

ar~*cJ

t r u m c z a t

± o n

One way of broadening searches in information retrieval is to use systematic abbreviation of words so as to bring together words which are morphologicalLy related, in the hope that they will also be semantically related. This can be done manually or automatically. Information retrieval intermediaries use manual truncation to conflate words which are both morphologically and semantically related. Intermediaries use their linguistic knowledge to avoid drawing in words which might seriously decrease the precision of the search. Truncation is often combined with boolean OR to bring in other synonyms. For example, the concept "communism" might be submitted to an IR system as "communis* OR marxis*" rather than as "commun* OR marx*" which would Lead to the retrieval of records indexed under "communication" etc. Manual truncation is not a particularly easy or natural skill to acquire, and cannot be considered for casual catalogue users. It is a facility which can be provided for those, such as library staff, who wish to learn how to use it. Many of the "keyword" type online catalogue systems do allow for manual truncation [13. The discussion here is limited to processes of automatic abbreviation or truncation which aim to conflate related words by reducing them to their stems. The word-segments to be removed are referred to as affixes. Rn affix can be a suffix Clike "ation"3 or a prefix Clike "pre"3. Many prefixes cannot safely be removed except in narrowly defined subject areas Csuch as chemical terminology}, because they tend to have a more drastic effect on the meaning of a word than do suffixes, which are often inflexional. The next three sections form a fairly condensed some of the stemming techniques which have been They are of a rather technical nature. Readers primarily interested in online catalogues might skip to Section 3.5. survey of published. who are prefer to

3.2 Methods and techniques in algorithm construction Many algorithms have been reported. Some of the ways in which they differ are outlined in the following sections.

-21-

tt&mmifng]

©©d

t^ufti^mti,©^

3o2o1/
D

2t©p<©t©y© @(P t @ ® g © § t

©©)t©bu? i © (nn)©th®d„ i t ®©n y©© ©

(Rl® © L g o r i t h w D ©©© b© i ' t e p g i t i © ©

l®©ig©©i (M)©t(rih0 ©©tb®d @P © (M)£^tyP® ®f b@tbo

Th® i t ® P § ) t £ v ® ©)©ib®d ^ i w @ v ® § © y t f i ^ © © ©in® b y @JP© f>®©) t h © ©i©©) C©®©©ti®)©® © i n g t © ( t e t t ® ^ © F g p ® y p © © t l e t t e r © © P © ^©©)©v®d ^ ® t b © p t h © ® t r y ® g y f f £ ^ @ @ 0 b u t t h ® §©i®© p r D i © © i p © l ©f gif=©(Dly®(L p @ d y © t i @ © § t i t l ( l ©ppl£©§>S)o F © ^ ©^<s5u^)l©5 F @ r t © © ^ © £ i © p © t £ v © © I g y p i t h y n ) [[23 p p © e @ g g ® g t h © ©®©d °©©otf@[ro©©b£(L£ty 0 £© th©©© £t@p©t£@(n)Sg ©t t h © f £ ^ g t p©g@ 0 D y ° y y y t l d b© P © ° p l © e @ d b y D i ° o © t t h © ©©©©©d £t@(p©t£©(n© D b i l i t £ D ©?®yld b©©®(©© ° b l © D ^ t £ y © l l ( l y D © b l © ° © y y t d b© ^©(©©©©d^ L©©v£©g t h © ©t©© °©@©f®©©D o Th® !l@®g©©t c©©t©h o©©tb©b p©©o©y®© t h © t @ © g © § t ©)©t©b£®g © f f £ ^ i ® ©y© £t©(r D ©t£©©o 1© t h © ©b®©@ © ^ ® © p t © 5 t h © l®t©g©©t © y f f £ ^ C a ® b i i l i t y a 3 y ® y l d b© ^©t©®y©d £© ©©© © t © p 0 2io2o2 C@©b£t£@y©(L <?©(L©©

Th©©© £©©Lyd© p y l @ § y h i © h p(p©v©©t t h © P © © ) ® V © 1 ©f © g i v © n ® f f i ^ © P ©l©©© ®f © f f i ^ © © i f © g i y ® y ® @ © d i t i ® © i © ©®t ©®ti©fi©dlo Tb©y ©®®t©®©ly i © y ® l v © © {©£©£©©©) (L©©gth ©®n^ d i t i @ ® s °p©m®y® i © p © h © © l /} © ° y©l©©© t h © p © © y l t y ® y L d b© d©©© i b © n f ® y p ©h©p©©i©p© l © © g ° i © ©® ©^©ropL© 0 S®(©© %tmm( © i n g p r ® y © d y © © © b©y© y®f>d°©p©©if i © p y l © © (L£[k©g h § i © v i 5 ©db y©l©©© y ® r d i © 5 U i © £ t ® d 5 0 o 5 h 2 d 3 St©(© ©)©b£f£©©t£®© P © p t i © y L © r L y i © i t © p © t i v © p p © © © d y r © ® d b © t t © p ©®©f(L©ti®© © e n b© ® © b i © v © d b y ( r © y i p i t i © g ©©(©© ©t©(©©0 P @ P ®K©(nnpl©0 t © © ® i n © L °V D mmy b© r © p l © © © d b y ° i D t ® ® @ ® f l © t © y © r d f®rt©© © ® d i © g i n Q V° y i t b ® t h © p p © l © t © d y@pd©s t h i s pyL© y y y l d ^ p p L y t ® t h © y©©d© O p o o v ° C t ® p © t r i © v © a p®©£©©° y b © n t h © ° © ^ D © y f f i ^ i © [p©©)©y©d3 ® P ° h © p p y ° Ct® r © t r i © y © D b © p p i © © © © ° yb©© t h © °(©©©©CI) © y f f i y ^ i © r©©@y©d3o (Rl©©th©r ©K©©p(L© i s t h © © h © © g i n g ® t ° p t r a t ® 0 b a t ® ©®®fL©t© ©®©d f®rt©© © © d i n g i © a b ° y i t h gr©©©n)©ti©©L Ly (p©L©t©d y ® r d s y h i © h ©h©©g© t h © ° b ° t ® a p t D s t h i © p y t © y@yLd ©pplly t ® t h © y ® p d 0 © b © ® r p t i ® © 0 Ct® r © t r i © v ® °mbmmpb° ©od c©b©@rb©©t°3„ 3olo4 C © © p £ t © t © ® © ©h t h © © © T t a © i£©t

Th© d i t f © p © © © © © ©b@y© ©©f©©©p®©d t ® diff©®©©©©© i © t h © © t r y y t y p © ®f t h © s i p r i t t a o PL(L © L g ® p £ t h © g 0 y h © t h © r i t © p = ® t i © © ©p (L©yg©©t (©©t©h 0 ©©©d © ( L i s t © P d £ © t £ ® y © p y ®f g y f « fi^dio D i © t i ® © © p i © © ©®© b© © @ © © t r y © t © d ©©©y©LLy 0 ®r t h © y ©©© b© g ® y © ^ © t © d ©yt@(©©>t£©©LLy ® r ©©©ji-®yt@©o©ti©©{L Ly tr®t© b ® d i © © @t t © ^ t o T h © i p ©i©© ©®d ©®@p© y i L L © t r © © g L . y i © t ( L y ©©©© t h © b © h © v £ ® y p ®f t h © © L g ® r i t h © ) o L©©o®n ®©d h i © ® v ® L u © t i © ® ®f ©®LL©©gy©© [ 3 3 © y g g © s t thmt yhiL© ^©ny©l L i ^ t © @f p®©©ibL© © y f f i K © ® g i v © m tr©©uLt©

©f

„®®„

3 Stemming and truncation a very high quality, the Length of time taken often makes this method impractical. In their evaluation, they test a method Cusing the frequency of word endingsD for the automatic generation of possible suffixes. They concluded that •fully automated methods perform as well as procedures which involve a large degree of manual involvement in their development" C1833. 3.2-5 Users' needs

The main function of conflation algorithms must be to improve recall; there will always be some searches where there is a loss of precision. The balance between recall and precision must be chosen to suit different classes of users. On industrial user of a retrieval system who needs a comprehensive search might be prepared to examine a substantial proportion of irrelevant material Ccaused by overstemmingD. For general library catalogue use, on the other hand, under-stemming is to be preferred to overstemming . 3.3 Conflation algorithms: a review 3.3.1 INTREX

One of the first conflation algorithms to be developed and tested was part of the Project INTREX Csee Overhage and Reintjes C4] for a general review). Lovins [53, who participated in this project, produced a list of suffixes by first examining a preliminary list generated from the endings of words from the Project INTREX catalogue. The list was used to see when the use of a given ending from the word in the dictionary would result in a mismatch, or in the omission of a stem which ought to match. This manual assessment allowed the author to refine the list of endings and to compile word specification and recoding rules. The final list contained around 260 suffixes. It was used in conjunction with both context rules and recoding rules. 3.3.2 RRDCOL Lowe and others CB] tested two algorithms as part of the RRDCOL project. The first used two passes through a list of 95 suffixes; the second used a single pass longest match algorithm with a longer list of 570 suffixes. Rfter tests, the second algorithm was adopted. Lowe and his colleagues obtained this list by a multi-stage process. First the characters of the most frequent words in the index were reversed, and the reversed words were sorted in alphabetical order. Then a minimum string length was established. The list was scanned for repeated character strings, on the assumption that strings which occurred with more than a certain frequency might be possible suffixes. Finally, these character strings were examined manually and

-23-

3 Stemming and truncation

suffixes were selected from them. The comprehensiveness of this suffix List meant that the number of context and receding rules could be reduced, increasing the simplicity of the algorithm.
3.3.3 Generation of suffix Lists

Lennon and his colleagues, in the course of their evaluation of conflation algorithms [33, extended the method used in the RRDCOL Project and in the INSPEC project Cdiscussed below}. R list of reversed words was used to produce a list of word endings which occurred with more than a certain frequency. These can be assumed to be suffixes although if this algorithm is to be used automatically, no recoding is possible. This is a simple but unwieldy approach. 5ince it operates as a longest match algorithm, the inclusion of a string "iveness" in a suffix dictionary also necessitates the inclusion of the substrings "veness", "eness", "ness" and so on. The proportion of strings which have a real utility is therefore reduced. This method was also used by Tarry 171 to generate several sets of equifrequent character strings from the ends of words. The method involves selecting from a body of text character strings of variable length occurring with approximately equal frequencies and with low sequential dependence. This suffix generation procedure can be used for automatically determining subject-specific or Languagespecific lists of suffixes. The incentive for using this technique for suffix generation was the supposition that character strings representing suffixes would occur more frequently than other terminal character strings. Rs well as this, it has been observed that Letter dependency within words decreases at the boundaries of word units such as affixes. Tarry's algorithm works on the Longest match principle and has no restriction on suffix removal other than that the remaining stem should be of a minimum Length. Since there is no restriction on removal the algorithm is context-free and uses neither recoding nor partial matching. Tarry justifies this approach by the desirability of eliminating "the Large amount of manual preprocessing required, both in the construction of the suffix lists, and in the formulation of the suffix removal rules" C7, p213. This algorithm was compared with the INSPEC algorithm; retrieval tests with the Cranfield 1400 test collection were made and it was found that the algorithm performed at least as well as the traditional algorithm [7, p76]. 3.3.4 INSPEC R conflation algorithm was designed by Field CB3 at INSPEC with British Library funding. The List of suffixes was compiled manually after consulting a Key Letter In Context CKLICD index. This algorithm uses a mixture of longest

3 Stemming

and

truncation

match and iterative suffix removal and incorporates several features which were designed to improve its effectiveness: minimum stem length, recoding rules and three stage conflation. This last application is particularly interesting. The word to be conflated is first dealt with by Rlgorithm 0 which removes stop words and common endings such as plural forms Cthis stage is partially iterative^. Words which are not stopped are then treated by Rlgorithm 1 which removes all other suffixes which are present in a longest match routine. In a final stage, Rlgorithm 2 makes adjustments to the stem, usually on the basis of stem length. Field claims that this use of a three stage process increases the overall efficiency.
3.3.5 Stemming in SMBRT and FIRST

The IR systems used in the 5MRRT projects incorporated stemming. The 5MRRT system bases all dictionaries on word stems rather than original words. The suffixes which generate the word stems are listed in a suffix dictionary, and each one carries one or more syntactic codes. These must be matched with complementing codes attached to the word stems in order to determine which suffixes match which stems L9, p323 . Dattola M O D has described FIR5T - the Flexible Information Retrieval System for Text - which is based on the methods developed during the SMRRT project. The most important part of this procedure is a stem dictionary; this is the basis of the conflation procedure. Words are added to the stem dictionary if they fail to match an existing stem and are more than three characters long. This method uses a stem dictionary of whole words rather than actual stems; new words are not added to the stem dictionary if they are suffix variations of existing stem entries.
3.3.6 MORPHS

Bell and Jones have described the retrieval system MORPHS in a number of articles including [113. This system Cthe name means "Minicomputer Operated Retrieval CPartially Heuristic} System"D is used at the Malaysian Rubber Producers' Research Rssociation. It incorporates automatic stemming. Bell and Jones M 2 D discussed the use of roles and stemming in an earlier version of the system as a means of improving recall and aLso incorporating some syntactic knowledge. They believed that the two techniques could be combined by replacing the suffixes by a limited number of role indicators. In this system stemming was performed manually by the searcher who could either, as in their example, search for MIX; or MIX Crole R3 - to include MIXING; or MIX Crole D) - to include MIXED. Rn extensive suffix list is used; its size is increased by its treatment of exceptions Cthe stems "cation" and "station" are included and are used in preference to the stem "ion"3 and by

)t(^mm£(n}(a) ®od?

tpy©©©ti©©

t h ® i n © L y © i ® n ®f ©h©©dL©©4 § y f f i ^ ® g 0 Th© ©y©t©(© © t t © m p t © t © g y © p d ©g©in)@t t h ® P®O©®©©1 ©f © p p © p © n t © f f i « © t p i n g © C a p p © ° i n D p p § § § y p © ° f o r ©^©i©pL©3 bv e t e e f k i o g t h © t t h © © t ® ^ i s pp@@@nt i n t h ® ©i®i© f i i © d©f©p© t h © © f f i ^ i s p©o©©v,©d 0 fi) (©inip©jo© §t®wo ( l © n g i h ©L©® h©Lp© i © p p @ t © © i @ g g ) £ n § t t h ® P®(nJi®)V©[l @f © p p © P ® n t ©ff©^©©o B©Ll ® n d J®(n)©m P©©@ynt h © ^ t h © y ^ © P © p y © © l © d (by t h ® g ® n ® p © t i © n ©f t h © s t ® ^ ° i n d until i h © y di©©@v©p©d t h © t t h ® p p © t © K 0 © n a ©rod t h © g y f f i ^ D © t ° h © d v d©©n © t p i p p ® d l ff(p(§iM) t h © o^opd ° © o i ( n ^ L D o S l o t h (L©ng©©i ©>©t©h ©]©d i i © p © i i © © ©© r th@d© © P © y©©dl i n t h © p©[©®v©(l ©f § y f d i ^ @ © ©End! t h © ©p©©fti@© @{p g t ® ^ d i © t i © n © p i @ © 0 J o Do J £©©©<©©x® ©x©d d©(©gy©©g^a© © © © d y s i ©

©©{©© © L g ® p i t h m © h©y® d©®n d©©©d@p®d ©nd t © © i © d f ® p n © t y p © l (t©ngy©g© © p p ( l © © © i i © n © Q Th® (©©pph@L®gi©©(l © L g o p i i h © ®f Q©p©©n© E 1 3 3 ©im© i © dJ©t©p©}in© t h © p©@t ©f © t©po© b v ^ © m o©®ving © y f f i n © © © n d p p ® d i ^ © © a P f f i ^ © © ©p© p©w®v©d i t ® p © t i © © L y d y © @ n © y ( l i i n g ©n © f f i ^ d i © t i @ n © p y 0 T h i © pp@©©©m y©©© © ©y©i©(© ©f @pd©p c(L©©©©©5 ©©©u©dng t h © t © f f i ^ © © © P © © d d © d t © t h © ©i©no i n © ® © p i © i n ® p d © p C i h i © p a r t i c u l a r l y © p p L i © © 4© © y f f i ^ © © 3 o H y u s i n g th©©© @pd©p ©I©©©©© © ^?©pd ^©hi©h i © i@ d© ©t©pu©©d ©®© d© © ^ i r o i n © d f © p t h © pp©^©n©© ©f © f f i ^ © © d © l © n g i n g t © ©©©h ©I©©© i n i y p n c Th© p®(n©v©l ®f ©©©© © f f i ^ © © i © f®tl@©?©d d y p © y ® d i n g ©f t h © r o o t 0 Hfi©r p © c @ d i n g / ) t h © p®®t d i © t i ® n © p y i © ©©©p©h©d mndD i d © ©)©t©h i © mmd(B 0 t h © p@®t ©nd t h © mff±u&% ©p© © y t p y t Q I f t h © p © i s n® m © t © h t h © n @ ^ t ©pd©p ©I®©© @f © f f i ^ © © i © © e © n n © d © n d t h © pp©©©©© ©@n)ti©y©© y n t i f L © p@®t i § f © y © d o C©r©©n© /} © ©Lg©-= p i t h ( © ^©©© d © © i g n © d t@© y©© i m t © K t y © L © n © L y © i ^ © n d p©Li©m @© t h © mmmum't ©@©©tpy©ti®(n @f L i © t © ©nd pyL©© 0 I t i© di©©y©©©d d y C © P © © © © E 1 4 3 ©© ©© ©l©i©©nt ®f ©)©pph@l®gi©©L ©©©Ly©i© © n d d©Mi©®© d © © i g © T O P ©©typ©L l©nguM)g© pp®©©©° ©i©g0

3o3oB mm
(Rl d i f f © p © i © t ©ppp@©©h h©© d©©n t © k © n d y p©©©©p©h ©?®p(k©p© © t Si©©)©n© ^©h© h©v© d©v©L®p©d © ©y©t©pi) ©©LL©d MRRS E 1 S 3 o T h i © © y ^ t e m y©©© © ©o@pph©©)© L©^:i©®© t © d©©®PDp©^© w&\rd% p © t h © p th©i© © © t t i ^ - © t p i p p i n g © L g ® p i t h c © o T h i © [©©k©© i t p @ © © i d l © t ® ©pi]©®pp©p©t© ( L i © g y i © t i © ©©©Ly©i© p © t h © p t h © n ©j@pph@L®gi©©L ©D©t©hi©g ©L@©©G Th© © y t h @ p © f © © l thmf °\rmt p i © © © l © p © p © t i ® © © (Lid© (L®ft t p y © © © t i ® © 0 p i g h t t p y © ® © t i © n ^ ©©d f©©©lkipi)g © P © © @ t i © © © d L y i p ^ © d © p y © t © ©© p©g©pd© t h © i p © d i t i t y t ® f i ( L t © p ® y t i © © p p p © p © i © t © t©p[©© ©p)d © ^ p © n d upan [©©p© y © © f y L ®©©©D E I S J o I© t h © i p di©cy©©i@© t h © ©yth®p© ©©©p©f>© t h © i p ©ppp@©©h ©K©(Ly©i©©(Ly © ^ i t h ipy©x©©ti@© © © y i o g t h © t t p y p o ® © t i @ © i © i o © y t f i © i © © t L y p@^©pfy(L ins P © © © L L ©©d pp©©i©i@©o TJh©y d® ©©t i©©(Lyd© ©@PifL©ti®© t © © h © i g y © © i © t h © i p ptfvidUo (Rl© t h © p©vi©^© ©d@v© h©© d©fr©®o©tr©t©d „ t h © i © © © p p © p © t i © © @T p©©@di©g py(L©© i n t © © ©®© < fd©ti©© © I g o p i t h m ©©© © y d © t © © t i © l Ly io©pp®v© p p © © i © i ® n o Th© @ p © r a ) t i © n ®f H P R S i © d © © © p i d © d d©L®^ 5 ° i t ©h©yLd d© b © p n © i n pni©d t h © t ©©y p @ t © r d r i © l io©pp®©^©(©©©t i n p p © © i © i ® n ©nd P © © © I L © h ® y t d d©

3 Stemming

and

truncation storage

balanced against the operational costs and computer required.

MORS has morpheme dictionaries and grammar rules for each language. They are used to split words into prefix, stem, derivational and inflectional elements. The extracted word stems are collected in a stem-file in which pointers back to the textwords containing the particular stem can be followed, enabling retrieval of these words. The morpheme dictionary contains affixes, inflectional endings and fillers. Each entry is stored with a 32-bit string indicating special morpheme characteristics and certain compositional properties. The morphemes in the dictionary are the Longest possible strings obtainable from all of the possible derivations C•traditionatity" for example would be viewed as a derivation of •tradition" and not •tradCeD"3. This morpheme dictionary is supplemented by two smaller lists. One includes "irregular 1 stems like Latin and Greek plurals and irregular verb forms. The other list contains strings which regularly undergo graphemic change CI ike "yto "ie"3; these transformations are processed automatically. R pre-processor checks to see if string transformations are necessary. Rfter this, the three lists are used by a decomposition grammar which deals with each word. Rfter having reached a certain stage in a word Ca prefix for example) certain conditions have to be fulfilled if the word is to be passed to the next stage. These conditions are listed in the morpheme grammar for the language. MORS was tested by a retrieval expert who carried out twelve real searches, once with and once without MRR5. Recall was increased by 6 8 % when MRR5 was used. Moreover, this was achieved without a significant decrease in precision Cthis did decrease but only by 7% from 6 8 % to 61%3. There were difficulties with compound words and phrases and with verbs; these were caused by Limitations within the structure of MORS and can be offset by modifications to the search strategy used. 3.3.9 Porter

fln iterative algorithm was developed by Martin Porter [2] at the University of Cambridge Computer Laboratory. He uses a concept which he calls the "measure 0 of a word. This is the number of vowel-consonant transitions in the word. It is used in some of the conditional rules: for example "remove terminal 'ance' if the measure is greater than one". The algorithm is a five-step, partially iterative procedure using a dictionary of around 60 suffixes. Porter notes that a point is reached in the development of a conflation algorithm when the inclusion of additional rules to improve performance in one area leads to a corresponding decrease in performance in another ares. He warns that unless this tendency is guarded against it is very easy for the algorithm to become more complicated than

-27-

3 Stmmminig

©yd tp©?y©©tl®y

it r x ^ l b© 0 sid Ib©p© I@ @ t©ypt©ti®y t© t©y t© b©©l with w©yd~f®py© which ©pp©©r t© b© iypyytsyt but wbl©b ©p© rmrm in y@©t ©pplIe@ti®y§ 0 ( f ©Itd>@ thd* ©^©tnnpL^s ^ © e d i v t / re b©y©pti©y c P Dp©©y©)©|[r©©yiyptl®y0 whl®b ®©©yr v©py in° tr©qy©ytty io th@ © © © © b y t © ^ ®f &©©©t Iyd©^©§© Sine© tb©r© will ©tw©y^ b© ©©y© © P P © P r©t© 0 F®pt©r gpgy§§ tb©t it is opt w©pthwhlll© trying t© c©p© with tb©y© ©©©©©o P®rt©p 5 © ©lg®plth© £§ ©Iypl©o It h©@ f©w rylL©© ©©b © ©y©tl diytion^fpy fl ©nd ©© I© ©©@n@yl©©l ®f ©©ypyting tI©D© ©nd @f ©t@p©g© 0 It w©© t©©t©b ©nd found t© @dhi@v© © ©©yp©r©bl© C©©ty©L(Lv ©tightly b©tt©p3 l©y©l ®f ©ff©©fly®n©©© tb©n th© ©Igyylfh© E 1KB3 y©©b pp©©I®y©tLy ©t th© C©nfc©Idg© C©nput©r L©b®r©t@(py C©©© b@(L@w f@p dl§©u§§l®n3 0 P@rt©(p^© ©Igyrlfb© I© th© ©n© u©©d by Pr©k©© E173 I© th© OPIPL0B) p©tp£©y©l gy§te^° This In©@pp@p©f©© © d©f©ult n©Iv© y§@p ©)@b© whidh I© b©©©d ©© th© P©p©p©h©©© ©y©t©m C2o4o33o Pn ©Kp©pi©yy©d y©©r ©©© ©y©ppld© th© d©f©ult© ©nd y©© ©y ©lt©py©tlv© y©yj©©nd ©)®d©c H©p©®y©p 5 y?h©p©©© P©p©p©h©©© ©y(Ly Included fpun©©tl©n,; GRlTPLOG In©©pp®p©f©© ©n ©ut©y©fl© y©nf(L©tI©n ©Igpplthno P U © © P ©©© b© [©©nu©l tryn©©tl®n 0 but th© d©f©uLt I© t© ©©©r©b f@p r,©L©t©d term© ©yt©yjc©ti©©lly a P©l©t©d f©py© ©y© pp©©©nf©d t© th© user „ with th© numb©p ©f §eeyrrine§i 3 Ih© U © © P ©@n ©©L©©t fp©m thi© ll©t © P Indly©t© th©t ©LL t©pn© ©r© t@ b© y©©do Fr©k©© h©© written) th©t th© CRTRLOB r©tri©y©l ©y©t©©^ h©© b©©© ©bow© t© b© D f©©©Ibt© a E173 but thi© ©@n©lu©I©n i© not bmmimd ©y ©n ©yt©n©Iv© t©©t ®f P©pt©r/}© ©lg®rltby I© © P © © 1 ©nvlp@n(y©nt o 3o3o1/® a©y©@© Dew©©© 5 © ©tg©ptfh©^ fLUSiP p©f©rr©d t© ©b®v© 5 w©© b©©©d ©n th©t d©y©L©p©d by b©vi©© ESJB but mntmndm b©vin© D initial Li©t ©f ©b©yt 250 ©ufflK©© ©i^f©Ld t© ©b®ut 1„200o It w©n ©ntlcip©t©d th©t th© ©i©© @f thi© ©uffl^ List ©sight ©(r©©te ppoblmmm ©f ©t©p©g© ©©d pp©©©©^Ing tlm© 0 J0©^©on ©@ped with thi© L©pg© ©yfflK LI©t by p©y©r^Ing the ©yffl^©© C©nd y©rd ©p©clfl© ©yftly p©©©y©(L ©©ndlti©©©) ©nd I©d©^Ing th©m by L©ygth ©yd by II©©L i©tt©p Q Ihl© ©Lg©rith©o y©©d © imngimmt y©]t©h (©©thodo Uytlk© [©©©t ©I th© ©tg©plth©© 0 D©w©@© d©©© y©t y©© ^©©©dlyg^ !©©t©©d 0 th©?5© ©r© ©L©©©©© ©t ©t©t© ©ydl©g 0 ©yd If two ©t©o©© mmtdh up t© © ©©pt©I© r^umbmp ®f ©h©p©©t©p© mnb th© p©i©©I©Iyg ©h©[r3©©t©p© ®f ©©©h ©t©m b©L©ng t© th© ©©©KI ©t©i© ©ydlyg ©L©©© 0 th©y th© tw© ©t©©i© © P © ©®yfL©t©d t® th© ©©(©© f©py D D©y©®y Iy©Lyd©© fifty ®f th©©© ©t©yi ©tydlyg el§§§@§o Po* ©y©ypL© ®f thi© Might b© with th© ©t©o^© °mb^(S)\f*bD ©yd °©b©©ppt 0 ^ by Iy©LydI©g a -ppt D ©yd D - r b D I© th© ©©©© ©t©f© ©ydlyg ©L©©© th©©© tw© ©t©o©© ©©n b© ©©yft©t©d t@ th© ©©y© §teia

3 Stemming

and

truncation

3.4 Evaluating conflation algorithms The effectiveness of stemming algorithms can be evaluated by assessing the degree to which A erms are overstemmed and understemmed. One measure is the proportionate decrease in the number of distinct terms after stemming. Lennon and others tested five conflation algorithms C33 as part of an evaluative study. They confirmed a previous suggestion by Landauer and Mah 1980 C183 that the the RPDCOL algorithm tended to overstem Creducing "posed", "positively" and "positioning" to "pos"). The Porter algorithm tended to understem Creducing "accuracy" to "accurac" but "accurate" and "accurately" to "accur"3. These conclusions are supported by the compression results which were achieved by Lennon and his colleagues. With the Brown Corpus, Porter achieved the least compression C38.8%3 and RflDCOL achieved the greatest C49.1%). The other algorithms tested achieved 4b.5% CLovins} and 47.5% CIN5PEC3. 5everal test databases were used and while the percentage compression achieved did vary significantly according to database, the relative compression achieved by different algorithms was similar. Retrieval tests demonstrated that algorithms which tended to stem generously did not necessarily increase retrieval effectiveness; the Porter algorithm tended to understem, but it performed better in the test than the RRDCOL algorithm which tended to overstem. The IN5PEC algorithm, on the other hand, is also a strong algorithm, but this gave the best precision orientated search. Lennon and his colleagues also performed a test for recall effectiveness. In this test, a similarity measure using trigrams performed well; but the Porter algorithm performed as effectively. They conclude that "... there is no relationship between the strength of an algorithm and the consequent retrieval effectiveness arising from its use". Filtering the emphasis slightly, significance tests showed that none of the conflation algorithms tested was significantly worse, and several were significantly better, than use of unstemmed words. 3.5 Stemming in online catalogues Rs mentioned in 2.4.1, we do not know of any catalogue accessing a general collection which uses automatic stemming. Rrnong specialised or experimental catalogues, there is CITE, which uses a stemming procedure designed for medical terminology [19]. For the intermediate version of Okapi we used a slightly modified version of Porter's algorithm [2] . This system was not put out for live use, but experiments involving the repetition of real searches from transaction

3

Stemming

®mdl t p y © © © t i @ ©

(Logs g h © ^ ® d f h © f ©v©© t h i s © I L g g p d l v °y^^r©f©i©©i©g0 p p @ © © d y r © © @ y t d e § y § § © © r i o y g l@©© ©f p r e c i s i o n ) 0 Th© @ f t © © ° © £ £ © d ©^©xnfpt© ©f °©©©©©©i©©)0 © o d 0 ©©©©i)y©i©©ti©n) 0 d©y@©£©g ©®(n)f l © t © d £ § © g®@d ©y©ygbo t r § § i o n ] f © p p © j © © t i n g f h © y©©@©dif£@©©L y § § ©f ©y©© © ©©^©^©££,©©1© y©©d

Th© ,d©©g©r© ©f y © i © h £ b £ t © d © t s w ^ i n x g ) it© ©©(Li©© © © t © l ® g y © © ©@©[© t@ ©©£©© fp©!© 4©?© c©y©©§ = t h © g © y © p © l ©©y©p©g© ©f t b © £^p£©©(L ©y©d©©d© © r p y b l £ © (L£b(p©py d©f©b©©© 0 @©d t h © (L©©k ®f § p © y i f i © i t © ©f ©]©©© ©f t h © ©@©[p©h©©o Pypth©©5 mmny ©©©r© d© ©o®t ©©rot ©© @ © h © y § t £ y © § @ § p e h o Th©© y © o t 4© f i n d ) ©(©© ®r £ y ® pg^gvsrD'fk £ t © © © 5 © y d ©y© ©®f p(p©p©ir)©d t © t®@h p t d©©©©© @f i p r @ l g v i ) o t p©©@rd© d@f®p© t h © y f £ © d th©[©o Of t(h@ ©©x©pl© ©©©p©h©© i © ?©fe(L© I h D 0 © t r y y g ©t©/©©£©g y @ u l d ©d©©r©©[L© © £ f © y f ©4 f t©©©t £©©© (p©di© (?><©di®i©g^ © t © 3 ©©d m(Q)(gl@(r>(n)£>§m €©)©d©©©3 o 0© t h © ©th©© b © © d 5 ©©©© d © g r © © o f ©t©f©©©©g y © y l d d © y © f i t © t i © © © t © £ K ©f t h i © © ©©©p©h©© C £ y © l y d £ © g e©d©(r , ©£©©^©©d©(r 3 ©i©t3 . SdiyU €h®£©© ©f
§4©©©D£©)^

p(p@©@d©^© ^©(p ©©!£©© ©<©t©i@g©©^

Gh®£®© y £ l l d© d © t © p © i © © d by t h © ©©©d t ® © p p l y © © p y i n g d©gp©©© ©f ©t©©©y©g© ©©d d y l i b p © p £ © © 5 g©n©y©l!Ly (L£o©if©d ©®{nnput©ti@(n}®L P d § © y p e © § 0 Of t h © ©)©fh©d© d©©©p£d©d £© t h £ © © b © p f © p D CdBRS C3o3o©]) (L@©k© t h © !©©©f ©i©b£t£©y©o I t ©I©© l®®k© d i f f i c u l t t© io©pl©©©©f © o d ©©©py t®t£®y©IL(ly d©t©©©di®go 0© £ t © p © f £ y © p p © © © d y r © i © © f f p © © t £ v © d©y©u©© i t © h o u t d ©©did © ©m©L(L©r d £ © f £ © y © © v ; C(L©og © y f f i ^ © © ' © P © t p © © t © d ©© © §©gy©rD©© @f ©b©©t©p ©©©© t © d© © y © y © © ^ £ v © t y p©i©©v©d3o It y © u t d y © y © L l y b© p © © © i d L © t © p © p f £ t £ @ © d o t h £ t e p © t i v © ©nd L©©g©©t (©©t©h p r © y © d y p © © £ © t © t©^© ©r3 ©)©p© ©t©gd 3 © Q C l t i m ©^©rth © © t i o g t h © t t h © d © g r © © ©f © © ^ p p © s s i © n i © h i c h ©t©©©i©g p^ody©©© i © £pp©L©v©nto E©©n i f ©f©©©£©g o^©dy©©© t h © ©y(©d©p o f © o t p £ © © i © ©o i © d © ^ b y h © l f t h © t @ t © L © t © r © g © r©gyip©©x©©t £© ©di©©©t y o © f f © © f © d o (d©©p(L© ©LL ©f t h © ©^©pd £ © d © ^ t © © (L©rg© f£(L© £© mmdm y p ©f p o m f i o g © ^ ©(r5 p © i © t © r © t © t h © [p©©@pd© © h £ © h ©p© £©d©^©d d v t h © y © ^ d g a 3 te ©h©©© P@pt©o^ 5 © @ L g © p £ t h © f © p t h © £ © t © p ^ © d i © t © O k © p i te®y§@ i t £© ©b©(pt 0 © £ © p t © 0 ©©©y t © pp©gp©z© © o d p © © d i L y ©y©£L©dil© 0 I t ©}©k©© © d © y f f © y r ( k i L © d y f © © ©f ©©d© mnd d © t © £© 2 S © ©©©©©)d(Ly (L@©gy©g©0 O© ©©y ©© (p©©g©© t © © t t © p © U P ©h©£©© f @ r t h £ © ptp©j©©t o ©©d © © p © l y © p L i t t h © p(p©©©dyip© £©t@ f y © ©t©g©© 0 t h © t £ [ ^ © t p©©f@po©i©g ° y © © k ° © t © © © i © g ©©d t h © f y © © y © d i © © d p©pf@^©oi©g °©f p ® n g © t © m © i n g a D B(B i y y y r p(o)\rmjimd ©©(©© °©p©LL£©g © t © © d © r d i ^ © t i © n a i n t © t h © a y©©k C3 ©t©g©0 T h i s i © d © y © p i d © d £© C h © p t © p Bo

©©^

3 Stemming and

truncation

References
1 MRTTHEW5 J R. PubLic Access to Online Catalogs : a planning guide for managers. 2nd ed. Online Inc, 1386.

2 PORTER M F. Rn algorithm for suffix stripping. 74 C3D, 1980, 130-137.

Program

3 LENNON M and others. Rn evaluation of some conflation algorithms for Information Retrieval. Journal of
Information Science 3, 1981, 177-183.

4 OVERHRGE C F J and REINTJE5 J F. Project Intrex : a
General review. Information Storage and Retrieval 10

C5/6D, May/June 1974, 157-188. 5 LOVINS J B. Development of a stemming algorithm.
Mechanical Translation and Computational Linguistics 11,

1968, 22-31. 6 LOWE T C, R0BERT5 D C and KURTZ P. Additional
Processing for On-line Retrieval

Text

CThe RADCOL System!) .

CTechnical report RRDC-TR-73-337D. 1973.
7 TRRRY B D. Automatic Suffix Generation Segmentation for Information Retrieval. and Word M.5c. thesis,

University of Sheffield, 1978.
8 FIELD B J. Semi-automatic Development of Thesauri using Free-language Vocabulary Analysis CPart 1 only}. CReport

no. R75/243. Inspec, 1975.
9 5RLT0N G. Automatic Information Organization and

Retrieval.

McGraw-Hill, 1968.
of the American Society for

10 DRTTOLR R T. FIRST : Flexible Information Retrieval
5ystem for Text. Journal

Information

Science

30 C1D, January 1979, 9-14.

11 JONES K P and BELL C L M. The automatic extraction of words from texts especially for input into information retrieval systems based on inverted files. In: Research
and Development in of the third joint Information Retrieval : proceedings BCS and ACM symposium King's College,

Cambridge, 2-6 July 7984. Edited by C J van Rijsbergen. Cambridge University Press on behalf of the British Computer Society, 1985, 409-419. 12 BELL C L M and JONES K P. R minicomputer retrieval system with automatic root finding and roling facilities. Program 10 C13, Jan 1976, 14-27.

-31-

3 Stemming

and

truncation

13 CERCONE N. R heuristic morphological analyser for natural language understanding programs. The 1EE
Computer Society's First InternationaL and Rpplications Conference, Chicago, Computer Software Illinois, 8-17

November 1377.

New York : IEEE, 1977, 676-682.

14 CERCONE N. Morphological analysis and Lexicon design for natural-language processing. Computers and the
Humanities 11, 1978, 235-258.

15 NIEDERMRIR G Th, THURMRIR G and BUTTEL I. MORS : a retrieval tool on the basis of morphological analysis.
In: Research and Development in Information Retrieval. Proceedings of the third joint BCS and RCM symposium

King's College, Cambridge, 2-6 July 1384. Edited by C J van Rijsbergen. Cambridge University Press on behalf of the British Computer Society, 1985. 16 DRW50N J L. 5uffix removal and word conflation. flLLC Bulletin, Michaelmas 1974, 33-46. 17 FRRKE5 W B. Term conflation for information retrieval.
In: Research and Development in Information Retrieval. Proceedings of the third joint BCS and RCM symposium King's College, Cambridge, 2-6 July 1384. Edited by C J

van Rijsbergen. Cambridge University Press on behalf the British Computer 5ociety, 1985, 383-389. 18 LRNDRUER C and MRH C. Message extraction through
estimation of relevance. Research and Development Information Retrieval. Proceedings of the RCM-BC5 Symposium, Cambridge, 23-26 June 1380. Cambridge in

of

University Press, 1980. 19 ULMSCHNEIDER J E and D0SZK0C5 T E. R practical sterling algorithm for online search assistance. Online Review 7 C4), Rugust 1983, 301-315.

-32-