E5

F'l-izzv' m a t c h i n g a n d s p e l l i n g c o r r e c t i o n
Introduction

5.1

Information science has traditionalLy been concerned with methods of storing and accessing words so that classes of orthographically similar words can be retrieved. Rs Long ago as 1961 Bourne and Ford published a review called - R study of methods for systematically abbreviating English words and names" [13. Since then many other techniques have been described and tested. These techniques were reviewed by Hall and Dowling in 1980 [23. We describe these techniques as word representation devices as they all share a similar function: to represent a word in such a way as to facilitate the retrieval of orthographically similar words. These devices can be used for the retrieval of misspelt words, but have also been successfully used to broaden retrieval. 5.2 N-grems 5.2.7 Definition and applications

Rn^n-gram is a substring of a word, where n is the number of characters in the substring. Digrams, trigrams and tetragrams have been used. The assumption is that words which have a high proportion of n-grams in common will be similar. Raising the threshold for the proportion of ngrams in common increases precision but decreases recall, and vice versa. The length of the n-gram strings which are used will also influence recall and precision; the longer the string, the smaller the number of words which will contain that string. Rt the other extreme is the •1-gram" C"monogram"?D. N-gram representation has two applications. First, it can be used for spelling error detection. The word "sociology" produces this set of tri-grams: "-so", "soc", "oci", "cio", •lot"j "olo", "log", "ogy", and "gy-". Its misspelling "socialogy" produces "-so", "soc", "oci", "cia", "ial", "alo", "log", "ogy", and "gy-". 5ince six out of nine trigrams are identical, one could assume that the two words are related as the orthographicaL similarity is high.

-41-

5 Fu^^y m<§)t(gb£{n)g ©nd ©p©tling ©©pp©©ti©n
pn©b i t c©n b© yggd t© d@td>et ©?©pd© ©blind i r t (nnopph©(L®p£c©Llv ©iiPiiitL©p ire ib© hop© thmt ih©y fcd.ll el©© b © mmmmr « t£cs®L(Lv p©l©t©do H tpigp©©) r©p(p©g©(ntioti©n ©f th© ^©rd D y ® g n i t i © n ° 0 f o r ©^©fppl©© pf®dyg@§ ib© tpipp©©}©? D~©©% % afei©% ° i © n ° 0 0 © n ° a o 0 a a a g n i t i v © " pp®dy©©© °©© P c © g ^ a @gn ra 0 Q gni% ° n i t % ! [ ] ( ; Rg©in5 linn© %in ©yt ©f ©nd ©Vd ii© iti n@ tpigpgpp© ©P© ©©mfp®n t© ib© t©?© fe©P)Pdf @n© © © y l d § § § U K i ) © o tb©'b ib©> b © ©®nd© ©p© p©l©t©d 0 © ) Th©p© ©p©y h o ^ d v g r fl ©©pt©in bi©©d©©nt©g©© t® fh© y©© ©f o* i= gp©Mi©o Sfepd© ©t©P©b ©P© @pib®gp©phi©©(l[ly ^©^v ©i©©i(L©p c©n b© g©©t©nii©©!tiLy di©©i©bh©p 0 F^©y©d! ©nd WJitltl©tt E3J ©it© th© ©n©nplL@ ©f ° p y n n i n g ° ©nd °©ynning° C1803 o .1© ®ib©p in©i©nc©§ 0 M©pd© mmy b© p©tpi©©©d5 ^hi©b© ihp©ugh th© © d d i t i ® n @ ©tfi^©©©0 ©P© @pp@©it© t© t b d qy©py i©p©\) Cf@p f ©^©mpl© °©pp)®©p ° ©nd adi©©pp©©p a 3 0 H y©©fyl,©ypy©y ©f pp©vi©y@ n-gp©n ©©p©pipi©nt© i© pp©vid©d bv 2©n®p©0 P©LL©©k ©nd 2©n©p© 142 0 Jhm f i r s t i n s t © n c © ©f o ° g r § ^ ©n©ly©i© pp®p©p ^ © p u b l i s h e d by Rdaroson ©nd B@r©ham ©© CS3 g ib©y b<D©©d ib©©p ^©rlk @ ib© ©s©yprpii@n t h © t a t b © © ©h©p©©t©p ©tpyntyp© ©f © ©@pk i s © p©t©t©d t© i t © ©©manti© © ©©©t©nt _ p© t© n©k© t h i s © y©©fyL b©si© f®p ©yt©m©ti© ©1©©°= © i f i © © t i b n ©f s o r d i d lnt©p~H®pd s i m i l a r i t y ©©©ffi©i©nts ^©r© y©©d © © b © s i s f©p © ©fLy©t©ring ©f ©?©pds^ tb©y f o y n d © t b © t i b i s ( © t© i © t y i t i v © L y p©©s®nabl© g p o y p s ©f w@pdic R L d ©©© L g r o u p ©f (©©tb©n©tic©t i i t l © s ©?©r© ©Lty©t©r©d by t h e i r [ ) L ©©n©tity©nt digr°©©i©o Clhi© pp©y©dyp© ^ © t©st©d by W i l l © t t © s ESJ? b© b©ynd t b © t i t g©v© p©@r p©muLt© ©dth th© Cp©©fi©Ld i ^©t ©©LL©©ti@©0 3 Bo2oS b© ©d ©-gp©i© t©©b©igy©g t© ©©g©©©© p©tpi©©©i ' @

b©©n©n ©nd ©rbb©©© 1173 ©p©©t©d © di©ti©©©ry fr©m tb© t©rm© i© th© t i t t © © ©t th© & © n t i © l d t©©t ©©LL©cti©n; th©s© t t r a s ^©(p© tb©n (p©pp©©©©t©d b © Li©t© ©f c © n © t i t y © n t di=gp©m© ©r s trigp©[©© 0 E©)©b ©f th© t©pm© i n th© 2 2 S ©Kperim©nt©L gy©pi©© ©^©P© tb©n ©i[©ilL©pLy irmp^^^mntmd by d i = gr©mj© ©nd tpigp©[©© mnd mmidhmd ©g©in©t ©©ch ©f th© <©opd© i n th© ind©,©© Xndmu t©P(©© ©?ith m § i n i t © p i t y ©©©ffi©i©nt gp©©t©r
m ©@©J© tbr©©h@ld ©©Ly© ^©©P© ©@n©id©p©d t@ b© v©pi©nt© ©f

th© quim^y <L^\rm ©nd ©yt@©i)©ti©©L{Ly ©dd©d t© th© q]U(i>pyg th© ©Kp©nd©d gy©pi©© © © © th©n yg©d f©p ©©©p©hing th© f i t © ©f ^ p d©©ypp©pit© i n th© m@p©j©il ©^©y0 FI©tpi©v©L ©Kp©pifp©nt© g©v© © i©v@IL ©f p©tpi©v©L ©ff©©tiv©n©©© ©tiich wmm ©t (L©©©t c©©)p©p©b(L© ©?ith th© tL©v©L© ©bt©in©d fp©(© © p©ng© ©f ©©n©©nti®n©L ©t©of©i]ing ©Lg©pith©©o Fp©ynd ©nd i d i l L © t t [[33 ©d©pt©d th© inv©pt©d t i t © ©tpyctyp© d©©©pib©d by dfap©©yli mnd ©th©p© [ 8 3 ^©hi©h d©m©n©tp©ted th©t ©y©h © ©tpyntyp© ©©uLd b© y©©d t o p th© © a l © u l © t i o n of © i n d L © p i t y ©©-=©tfi©i©nt© ©nd f©r th© p p o d u c t i o n ©f p©nked output e Fp©ynd ©nd b?il!L©tt ym©d © d i c t i o n a r y of 12,000 t©pn©^ © © ^ntry © h i n th© inv©rt©d fit© c o n s i s t e d ©t i n n-

5 Fuzzy

matching

and

speLLing

correction

gram and a pointer to a List which contained the term numbers for each occurrence of the n-gram in the inverted file. The procedure separates the search word into its ngrams, identifies the appropriate Lists in the inverted file and, finally, "adds - or merges the lists in order to identify the number of n-grams common to the query term and to each of the words in the file. Freund and tt/illett used the Dice similarity coefficient to compare the similarity of the search terms with the index terms. These index terms were then presented to the user for possible use as query expanders. In their tests, even with the lowest threshold, there were usually less than 20 words for the user to choose from. Freund and Willett accept that this would be unwieldy with a larger file such as a library catalogue 13, p1B33. 5.2.3 Use of n-gram techniques to detect speLLing errors

Several different methods have been used. N-6RHM FREQUENCY TABLES Morris and Cherry C9, 103 extracted digrams and trigrams from text words and used them to create frequency tables. The text words were then checked against a small dictionary of common words collected from about one million words of technical text. The statistics were used to calculate an index of "peculiarity1' for the unmatched words and used to rank the unmatched words on the basis that those most likely to be misspelt would appear at the start of the list. Cornew [11] also used digram frequency tables to convert an unknown text word to the dictionary word it most closely matches. The new word is then looked up in the dictionary and the process repeated until a valid word is found. R similar method has been used by UlLmann L123. He used ngrams to convert each unknown word to the most similar dictionary word; this method can find all dictionary words that differ from a given word by up to two errors. On ngram method is also given for correcting up to two substitution, insertion, omission and transposition errors without doing a separate computation for every possible pair of errors. Its application is limited, however, as it is described only for six-letter words and it is dependent on the use of special hardware. PERMISSIBLE SYLLABLE 5EQUENCE COMPARISON Nussbaum and Schek [133 used automatically generated tables for error detection which describe permissible syllables and syllable sequences based on clusters of acceptable initial and terminal letters.

-43-

5 Fuzzy

matching

and spelling

correction

INVPLID TRIGRRM DETECTION Much useful work has been done as part of the 5PEEDCDP project on spelling error detection and correction [4, 14, 15, 163. The method of trigram spelling correction used by the 5PEEDC0P team differed from previous work in that it used direct measures of the trigram error probabilities rather than relying on the frequency of of trigrams. The rationale behind the 5PEEDC0P experiment was that miskeyings and misspellings would contain invalid trigrams. R study C4] was designed to determine if there is sufficient difference between the trigram compositions of correct and miskeyed words for keying errors to be detected. The motivation was the? fact that if word boundaries are included there ar& 18,954 possible trigrams using the English alphabet, but only a small proportion of these trigrams actually occur in text. Hence it is a reasonable assumption that many misspellings wiLl contain invalid trigrams. In fact many misspellings do not contain "invalid" trigram,- - the transposed miskeying "dictoinary" for exampleExperiments revealed, moreover, that the method gave inadequate recall and precision whatever the threshold chosen. The authors suggest that it might be better to use syllabic n-grams rather than trigrams. Using positional and cooccurrence information about trigrams could also improve precision and recall. This would have the advantage that it may be possible to determine the position of an error in a misspelling. P word is assumed to be misspelt if it contains two consecutive trigrams with error probabilities greater than some threshold. This method determines the error location accurately to within one character in 9 4 % of instances, although it cannot distinguish accurately between different error types. POSITIONAL N-GROM 0NRLY5I5 Riseman and Hanson examined the effectiveness of various methods of using contextual information to detect and correct keyboard errors C173. They used positional "binary" n-grams to detect miskeyed words, establish the position of the error and in some cases to determine the character which can correct the word. Riseman and Hanson contend that while positional n-grams require more storage, the process of collecting and storing contextual information is simplified and computational complexity is reduced. Moreover, positional information is more compact than probability information about n-grams.

5 Fuzzy

matching

and spelling

correction

Carlson [183 used positional trigram probabilities to correct errors in English first names and fix the position of the error; an error correction rate of 9 5 % was achieved. 5-2.4 Different The value of n techniques. values of n

has a strong influence on all n-gram

If n-grams are to be used in a spelling error detection system then a high value of n is more likely to make erroneous spelling produce "peculiar" n-grams. This is certainly not always the case C"socialogy" generates acceptable n-grams for any value of n } . But if the object is to find as many words as possible of which this may be a misspelling, then n should be one or two. The problem is different if n-grams are being used to find words which are morphologically similar. In this case, the value of n should presumably be rather close to the average length of syllables: i.e. n should be two, three or four. In practice, computing storage and processing requirements are an important factor. Substantial storage is needed if an additional index of "long" n-grams is to made from an initial inverted index. Several experimenters have used trigrams because they represent a compromise between digrams which are often inadequately "strong" and tetragrams, of which there are a great many. 5.2.5 Effectiveness tests

Lennon and others 17] evaluated the effectiveness of a similar technique to that used by Hdamson and Boreham LSI. Index terms with a similarity coefficient greater than a threshold value were considered to be variants of the query term and automatically added to the query; the expanded queries were then used for searching the file of documents in the normal way. Retrieval experiments demonstrated that this procedure gave a retrieval effectiveness which was at least comparable with that obtained with a range of conventional stemming algorithms. N-gram measures seem usually to have been used in experiments with rather small files. The method used by Lennon and others was reliant on the matching of the search term with every term in the file. The technique used by Freund and Willett C33 performed with reasonable accuracy on their 12,000-word dictionary but it would probably need to be modified if it were to be used in a much larger library catalogue. Freund and Willett point out that the use of trigrams can lead to an unacceptably large number of non-related items, especially if a Low similarity threshold is used. They feel however that "the

-45-

5 Fuzzy

matching

and spelling

correction

numbers of indexed terms retrieved using trigrams is quite acceptable for rapid visual inspection at a terminal" [3, p1823. Using digrams rather than trigrams sometimes retrieved a ^ery large number of words. Experiments conducted with the 5PEEDL0P system demonstrated that the invalid trigram method gave inadequate recall and precision whatever threshold is chosen. The authors suggest that the adoption of a more sophisticated error detection measure which might use syllabic n-grams rather than trigrams or use positional and co-occurrence information about trigrams could improve precision and recall. Trigram analysis has the advantage over the use of a dictionary in that it is sometimes possible to determine the position of an error in a word. This is inherent in the method used since a word is defined as misspelt if it contains two consecutive trigrams with error probabilities greater than the threshold selected. The trigram analysis method determines the error location accurately to within one character in 3A% of instances, although it cannot distinguish accurately between different error types. Riseman and Hanson [17] compared a binary positional trigram correction procedure with a dictionary lookup procedure. They used a fairly large set of six letterwords. The positional trigram method was not quite so effective in correcting errors as the use of a complete dictionary, but the detection and correction rates were high enough that the difference was marginal. They conclude that, if the dictionary is fairly Large, the trigram method is computationally faster and occupies less storage. However, they assume that the entire dictionary has to be searched. This is only the case if no assumptions are made about the type and nature of the errors. In practice, dictionaries are stored in such a way that comparatively small lists of candidate corrections for most erroneous words can be found rather quickly. 5.3 5oundex, soundex-type and other abbreviation codes 5.3.1 Definition and applications

Soundex was patented as a clerical technique for the manual coding of names. It was designed to help in the retrieval of misheard or misspelt names. There have been many modifications of the original 5oundex procedure for different applications, and several programs have been published [19, 20, 213. The name has come to be used for a wide variety of word representation techniques. CWhen used generically we write it as "soundex - by analogy with "hoover"3. Unlike n-grams, which represent a word by a set of character strings Cand so greatly expand the original number of characters!), soundex-type codes represent a word by its most significant characters Cand so reduce the original number of characters}. The original Soundex

5 Fuzzy

matching

and speLLing

correction

represented names phonetically. It retained the initial Letter, removed vowels and a few other letters, replaced consonants by codes for phonetically related groups, removed repeated codes, and finally truncated the name at four characters. Many, but not all, soundex-type procedures are also phonetically based. The most appropriate application for a soundex type representation code is in finding candidate replacements for misspelt and miskeyed words. 5.3.2 Use of Soundex-type codes: a survey

Tests conducted by the Dominion Bureau of Statistics of Canada [22, 233 demonstrated that while 5oundex compared favourably to other codes it did not perform adequately with non-Western names. 5oundex was designed to retrieve words after errors in hearing rather than keying and spelling errors. Even so, it is easy to find names which present retrieval problems - •Rogers" and "Rodgers" for example. Fenichel and Barnett C243, quoting Olberga C253, point out that there is some evidence that written spelling errors are often misconstructed from the correct forms by the same errors as phonetic errors of hearing. Davidson [263 used a soundex-type algorithm to encode the names of airline passengers in order to cope with misheard names. The Davidson code is not phonetic: it was felt that the international scope of the names to be included would make the phonetic equivalences of certain letters difficult to standardise. Opart from this it is almost identical to Soundex, except that there is a fifth character which contains the initial of the first forename, if known. Blair [27] tried a soundex-type coding scheme which aimed to retain the differing amounts of information associated with different relative positions in the word and with different letter frequencies. The highest weight is given to the first letter, followed by the last Letter, the second letter, the next to the Last Letter and so on. Each Letter was scored by combining its positional score with its letter-frequency score. The "Least important" Letters were then deleted until the required code length was reached. Blair's code correctLy identified 69 out of 117 misspelt words and incorrectly identified two. Errors arose either because the word was not in the original vocabulary or because the misspelling was so extreme that it gave rise to a different abbreviation. Blair suggests correcting the first type of error by adding it to the vocabulary when it is updated. The second type of error is corrected by creating a special index in which the correct spelling of a problem word, and its abbreviation, are identified by a link.

-47-

FUJ^^Y

/ © © f c b i o g mtnd § p © l i i © g

©©p©©©4i©©

0©©©p©y d © 4 © © h © i g y © f © p 4b© ©©x©py4©p d © 4 © © t i © © ©©d ©®pp©©ii@© @f © p e r i l l i n g ) © P P ® P © i § d © © © p i b © d i o E2S3 0 His ©i©4b®b ©@©uinfi)©© 4b©4 i ] ©©pd ©4pi©b e©fn)(n)®4 b© f © y © d i n ® d i © 4 i © n © p y b©© ©4 ©)®§4 ©©)© ©pp©pg i©©p©©ii®Po o f b©Y/b®©pd ©pp®p@ £ 0 © p © 4 p i © © © l gygtgwa ©b®t©©d i b © 4 © V © P ©0% © f t p r @ p i w g p § e©y@®dl b y © ^©©(©(gb d i g g i n g © P ©^©4P© t © 4 4 © p ©p © © i o g t © 4p©©©p®©iii©©o F i © pp@©©dyp© ©?©pk© b y ©©©©©ding 4b©4 § n y ®o© @f 4b©©© © P P ® P @ ©FLgbb b©©@ @©©ypp©dc 14 P © © © P © © © ©L(L p@©©£b!l@ ©pp©p© © f 4b©©@ 4vp©@ d^p y © £ b © © 4 £ f £ © d ©®pd© u n t i l © d i © 4 £ ® o © p y ^ © t © b £© ®p £© ©©4 f ® u n d o T©©4© © d 4 b © ©@(l(L©©4i®n ©if ©©©un®n §p©SLH£ng © P P ® P © € © f 4b© f ® y p 4vp©@ ©b@©©3 (op©© © §©©©©©© p©4© © f @©©p 1)5% o D©I©©P©U ©©x©p©p©b b£© 4©©bn£qy© 4© 4b©4 © f H(L©£p © n d 0 f ® p 4b© ©©rp©© © f ©©©©©©© (©£©©p©L(L£©g)© o g£(M)£(L©p p©©y(L4@ ©?©P© @b4©in©do f ! b © i p 5 © 4©©bnig)y© 0 p©pb©p© ©@b § y p p p £ © £ n g f l V o e©y!Ld n®4 d © © l p©L4b ©t©©b£n©=g©pbl©d 4 © ^ 4 a Tb© © © n p y 4 © 4 £ ® n © l ©@©4 £© n®4 p i © © © b u t i © H i t e i v 4© b© © u b © 4 © n 4 £ © l g i v © n 4b© © ^ 4 © n © i v © ! 1 © 4 4 © P ®©©p©pi©©n© ©?bi©b b©v© 4® b© ©}©d©c Ffaypn© © n d F © r d E 1! 3 ©©M©o©p£©©d d i f f © p © n 4 ©x©4b©d© f ® p © b b p © v i © 4 i n g ©?@pd© ©y©>4©cn©4i©©lL^o f b © y 4©©4©b 4b© p©pf©ppD©n©© @f 4 b i p f © © n b©©£© 4 © © b n i g y © © i © i b © p © 4 p i © v © l © f 4©©bni©©(L d®©un©n4© f p © ^ © © © I L l © ® 4 i © n ©4 4b© S 4 © n f @ p d F©n©©p©b 2 n © 4 £ 4 u 4 © o T h i © i © © 4 i l t © u @ © f u l ©yp©©y © f © b b p © © i © i i @ n 4©©hiniqy©© D b u t 4 b © y d i d n@4 ©@n©id©p 4b© © f f © © 4 i v © n © © © © f 4b© 4 © © b n i g y © © © i t b © p f © p ©c©)t©bing © ) i © © p © l 4 i n g © @P 4 ® P i n © p © © © i n g p©y©(L(t 0 F O P © © t n b i n g ©)i©©p©L4 p©p©®n©l nmmtmm 0 S p © © n f i © L d [[293 ©©iMip©p©d f b © S©y©d©^ ©©d©p f b © D©vid©©© c©d© a ©^©©b nmmm ©©,©p©hi©g © n d © ©©©p©b ©o ©©©©nd©pv © b © p © © t © p £ ^ t i © © ©y©b @s d © t © © f b i p f b ^ § © K ©p ©g©o Tb© f©©4 ©&©pL©v©d wmn b b © t © f f i n d i n g dyp(Li©©b© p©y©rd© i © © d©t©b©©©o O f t b © ©©d©©0 D©vid©©© g©v© m © L i g b t L © b © f f © p b i t 4© Pn]i©©'©f©h p©4£© 4b©o ©©©© 4b© ©^©©4 ^u^nmmm 4 © © b n i p y © d i d ; si g©v© 2 o 3 4i©j©© f©\©©p o©i©{©©4©b©© 4 b © n d i d S©ynd©K 0 I i_nd©K d i d pp©dy©© © © L i g b t L v b i g b d P \numb(wr © f iru& PD©4©b©© i b © n D©vid©@© b y i 4b© d i f f © p © n © © ^©© o © g L i g i b L © 0 B p © © o f i © ( L d ©@©©flyd©© i b © 4 ° 0 0 o 0 © v i d © © © / ? © i © 4b© 4 © c b © i g y d ©f ©b©ic©c m )2 0 b y t b© o©©k©© © y g g © © 4 i © n © f ©P i © p p @ © i © g £4© p©pf©p©©i©©©o F©p ©^©©p(L© 0 D©vid©@© (©i©©©d f y y p f © © . © i©)©b©b©© ©©pp©©4(Lv; ^®d© b© S@y©d©^g © f 4b©©© D Etovid©©© [©i©©©d © i g b f b^^miu^m 4b© 1 © 4 4 © P © °m° ©od °©° ©?©p(i ©,©i [©©pg©do Tb©©© p©©yL4© d© o©4 ©©©fip[© p©©©p©©4i©ni© ©^pp©©©©d b© (nJ©©p© O 0 3 4b©4 0 © v i d © © © ij©©y(Ld pp©dy©© ©)©p© f © l © © o©g©4i©©© 4b©© S © y o d © ^ 0 B p © © © f i © L d d©©© ©@4 4©©4 © © ^ © d i f i © d D © v i d © © n ©©d©D

~m~

5 Fuzzy

matching

and speLLing

correction

5.4 Fuzzy matching in online catalogues 5.4.1 Spelling correction

Reasons for providing a method to deal with miskeyings and misspellings were given in 2.4.3. This section discusses the methods which can be used and how they should be offered. One problem in online catalogues is when to assume that a word or phrase is misspelt. If the lookup procedure is able to determine that there is a unique key which is a near match with the user's key then there may be no need to perform any correction. This would apply particularly to phrase-matching systems. Where there is a single subject heading or name or title which matches on all except the last few characters of the user's key, this should be offered as a match Cwith perhaps an unobtrusive message to the effect that "this doesn't exactly match your search"}. We do not know of any catalogues which can do this. It can be rather demanding on computing resources, involving stripping off final characters and shuffling around in the index. Even with keyword access, if a title or subject word is not found but the proportion of words which have index matches is high enough, then the result of a successful OND on the words which are found has a reasonable chance of being the sought item. The original Okapi system would do this. Better precision may be obtained by using an inverse word a frequency weighting Cgiving a higher weight to rareu words). In the SWRLCRP LIBERT05 system, a "notional" weight is assigned to words which are not found in the appropriate index, and a search missing a word can still succeed, although the user will be warned that the itemCs3 found do not match the search exactly. Spelling correction in online catalogues should also take into account the fact that spelling errors vary according to the type of search. Transaction Log analysis of Okapi '84 [31] showed a marked difference between errors in specific item searches and subject searches. In specific item searches users are often copying from a printed source. In particular, there are rather few errors in personal names, and they are more likely to be phonetic or speLLing mistakes than keying mistakes. In subject searching miskeyings predominate, and they are frequent. The simplest method of dealing with search terms which are not found is to report a failed search, Leaving it to the user to re-enter the search appropriately. Many users do not notice that they have made a mistake and often leave the catalogue assuming that the sought item or subject is not in it. Hence this option, or "non-option", which is what most current keyword-type catalogues provide is

-49-

5 Fuzzy insupportable.

matching

and speLLing

correction

The next Level is to provide a specific message about each word which was not found. Users still have to re-enter their search, but at least they know why it failed. 0 few current catalogues do this. If a suitable fuzzy matching procedure is available the same result can be reported to the user together with an option of looking for •similar - words Cor names). This was tried, although not on V e a l users, in an intermediate version of Okapi. 5.4.2 SpeLLing correction using n-gr&ms

The work of Freund and Willett C3] has been discussed in 5.2.5 and 5.2.6 above. It was primarily designed to improve recall by using n-grams to retrieve variations of root forms. N-gram representation has been used, however, in the detection and correction of spelling errors, notably in the 5PEEDC0P project. This project has considered both the use of n-gram analysis and dictionary look-up for spelling correction purposes. The use of dictionaries will be considered later as this is technically different from the use of algorithms. None of the experiments which have been described have been applied to library catalogues. The 5PEEDC0P experiments were used in the batch editing of chemical information. There are at least two important ways in which spelling correction in an online catalogue differs from the 5PEEDC0P environment. First, correction has to take place in real time with Limited computing resources. N-gram techniques which need a Large amount of computer storage space and processing time are unlikely to be adaptable for use in an online system. Secondly, it is possible that this technique is particularly well suited to a "hard Language" scientific discipline where the Language is generally well structured and unambiguous; n-gram analysis of the language of other fields of knowledge may might be considerably less profitable. It should be noted that even with the advantages of working with hard language information, and being able to disregard the considerable problems of designing a suitable interaction, the method gave inadequate recall and precision whatever the value of the matching threshold. The authors' suggest that the adoption of syllabic n-grams rather than trigrams, or the use of positional and co-occurrence information about trigrams, would improve precision and recall. However, the latter would probably increase storage requirements. To summarise, although n-grams have been used experimentally, it is not at all certain that they are suitable

-50-

5 Fuzzy

matching

and speLLing

correction library.

•for use in the online catalogue of a general 5-4.3 Soundex-type codes In online catalogues

Unlike n-gram techniques, soundex-type devices have been used in experimental online catalogues. The Bibliographic Recess and Control 5ystem CBRC5) developed at the Washington University School of Medicine Library includes a facility which will look for approximate spellings in the author, title, subject and series fields. If no records are found from an implied Boolean RND, then the system automatically looks for approximate spellings. It uses a simple soundex-type algorithm which drops trailing "s", doubled consonants and vowels other than initiaI voweIs [ 32 3 . This procedure is automatic and displays a helpful message to the user "frying approximation search" before showing records which contain approximately matching words. BRC5 is notable for its simplicity and its cordiality, making few demands on the user. No tests on its effectiveness appear to have been conducted. BRC5 is used in a medical Library. Medical Language may be particularly appropriate for soundex-type processing as it is regular in word construction and unambiguous in application. Problems of spelling correction will be more complicated in a general academic library serving a much wider population and covering a wider range of subjects areas. The retrieval system which has been developed at Massachusetts General Hospital 124] also attempts to correct spelling and typing errors. If no match is made even after the user's term has been stemmed then a two phase process is used. This process first identifies similarly spelt words in the database, and then interacts with the user in order to see if any of these words were the intended word. It uses a soundex-type scheme which deletes vowels, "singles" repeated consonants and conflates similar sounding consonants to a canonical representative of that class. This is supplemented by a dictionary of common misspellings, alternative spellings, and non-preferred terms Cwhich are stored but not displayed to the users to avoid encouraging their u s e ) . The dictionary also includes obscenities which are stored as terms so that they can be ignored and not printed on the screen. Even after approximate matching, the process does not demand an exact match; a match is achieved if the first few characters are identical and the length matches to within a margin of 20%. The user may specify that the search be restricted to certain indexes Cit can distinguish, for example, between the names of drugs, anatomical terms, laboratory test, and therapies). The process finds approximate matches for about half of the searches which fail to

-51-

Py©©y m&tgb&mxgj ©od ©p©UIi©(p ©©pp©cti©n find ©n : t or ©t©©3 i§teho .

Th© ©ytb@p© r©p®rt th@t in tn©©t in©t©n©©© ©nly ©ni© t©p© i© f®ynd but @(©c<g@i@(n}©(lly MP 4© b<©©L©© n©t©b©© ©p© n©b©o Mn&n ibi© b©pp©n§ 0 th© yg@r i© ©@ted ybi©b i©p©j £if ©my]) i© eop^dcto 0b©yi 8 0 % ©f th© §ygg©©i©d g@©pchd§ ©p@ ©(©edpted by th© y§dP§ ©Itboygb th© Logging ppoy^dyr© ©©nn®t @©ti(ffi)<Dt© ©h©i pr@p©rii©n of th©©© §yee©i§©§ ©©pp©§p®od 4© ib© y§§p§ 5 ®pigin©L ini©nt ©nd ©h©i pp@p@pti©n ©p© ©®©©pt©b ©© ©lt©pn©ii©© ©od ©)®©©l i©r©)§o P p©©i©©? ®f y©© ® © © P f©yp c©©©th© d©©x©n©ip©i©b ib©t D1S ®f g@©r©h©§ f@yod ©©tcb©© ©f ©©(©© sort ©i @©© ©f th© th©©© ©i©g©© 0 Th© ©yib®p© d© ©@t ©i©i© ©h©th©p flibp©py ©b©ff ® P y©©p© ^m'gimtd © ©y©©©©© p©t© of ©b@yt b©tt ©© (b©i©g ©@ti©f©©t@py 5 ©@p © p©p©©ni©g© figyp© f@p ©>©©p©b©© ©hi©b l©c©t© ©)®p© th©© ©©(© ©)©t©bo D©ib ©f th©©© t©©i©r§ ©©(LI b© inp@pi©oi f©r th© @©©p©l(l ©y©©©©© ©f th© f©©i!litv© Th© P©p©p©b©©© ©y©t©[© mt D©th I©p©©l b©©pit L in B@©• t o n ©I©© tpi©© t© b© t©b©p©nt ©f ©p©tting © P P ® P 0 On© ©f tb<s P©p©p©b©©© d©©ign ©bj©©tiv©© ©©)© t@ ©11®^© t © y§gr t a o o t o p ©b©t©y©p ©©©©s ©©tyr©Li tbi© wmm b©@®d @n th© pp©p®nidl©r aiineg of ©bbp©©i©ii@n© ©©d ©yr©©©^' ) in o©©di©©l tit©p©4yr©o Th© ©y©t©ffi ©ti©npi© t© ©)©t©h ©hi©t i© typ©b t© it©o©© in t tm d©t©b©©© by ©pplying ©©th©b© ^hbicb © p©r©@© ounigbt y©.9. l h « ©©in t©©h©ipy© i© t© (Look for p©rti©l ©©i©b©© @© ^©®P d s j thy© HEW EN JOUR HEO ^©©Ld p©tpi©v© bEffl EMSLKWD JOURI ;MHL OF (MEDICINE ©ft©r ©tt©pnpting thi© ©©tnbing t©©hnigy©o :f th© truncation ©f ©/®pd© f©il© tb©n th© order of M®rd© i ) mado fy©©i©p ©©d indi©idy©l ©©nd© i© ©©©©©©d php©©©© ©r© p s t a t s - d C333o H diff©©©nt ©nd (©©p© ©©phi©tic©t©d ©ppro©©h i© t©(k©n by th© LEXICON i©f©ro©©ti©© P©tpi©©©t ©y^t©^ E34J 0 It y§§§ © t©?©~ ©t©p pp©y©dyr© ©@©©i©ti©g ©f © (modified S©y©d©K w>v^^^zn C©^ith © p©L©tiv©Ly L©©? thp©©h©LdJ f©LL©i©©d by © high thr©©h©ld ©y©t©© 0 Thi© ©h©yld ©yt©[©©ti©©L Ly ©orr©©t BO-70% ®f th© i?pror© ©o©©y©t©r©d IL3S3) „ Thi© ©@rr©cti©n (p©t© is ©lightLy b©tt©p th©n th©t ©yhi©v©d ©t M©©©©©hy©©tt© G©n©r©L h©©pit©L £ffi©b©yt b © t f 0 3 o ( i ©Lt©r©©ti©© ©ppLi©©ti@© ©f S©y©d.©© h©© b©©© ©ygg©©t©d Ro for y©© i© © lUb^rn^Y ©hi©h i© ©j©vi©g ©<©©y f^mn th© ©oo©©pt ©f y©if@p©) h©©di©g© 0 2t h©© b©©n ©©©p©r©d t@ ©t©od©rd ©y©th@d© ©f ©yth@pity ©@©tp@L § °S©y©d b©©©d ©©ding focy§d§ ©o th© ©io©i(L©riti©© f©y©d ©©]©ng th© ©©©y f@p(©© ®f © nmmrn th©t © p©p©@© Pajight y©© ©h©© §©©p©hi©g © d©t©b©©© ©od bring© th©[© t©g©th©r f©r © ©©©p©h©r') © p©ry©©L ° C3B 5 p132J 0 Thi© ©©ynd© G^©th©p (Lib© ©n ©ti©npi t© ©void pr©p©r © P @ © © p©f ©p©n©ing 0 H©f©p© ©©©yming th©t © u^mt ° © ^©®pd © P n©©)© su © (Anii§tDk©0 it ©h©yLd b© L@©b©d yp in © li©t ©f h o n ° pp©f©pp©dffl f©pm©o If found c th© ©©©p©h ©hould ©yt©= p^©ti©©LLy b© dip©nt©d to th© p©n©rd© ind©©©d und(mr th©
i©d

p©f©p©n©©,

5 Fuzzy

matching

and spelling

correction

References 1 BOURNE C P and FORD D F. R study of methods for systematicalLy abbreviating English words and names.
Journal, of the Association for Computing Machinery 8,

1961, 538-552. 2 HRLL P R V and COWLING G R. Approximate string matching. Computing Surveys 12 C4D, December 1580, 381-402. 3 FREUND G E and WILLETT P. Online identification of word variants and arbitrary truncation searching using a string similarity measure. Information Technology :
Research and Development 1, 1982, 177-187.

4 ZRMORR E M, POLLOCK J J and ZRMORR H. The use of trigram analysis for spelling error detection. Information
Processing and Management 17 C63 , 1981, 305-316.

5 RDRM50N G W and BOREHRM J. The use of an association measure based on character structure to identify semantically related pairs of words and document titles.
Information Storage and Retrieval 10, 1974, 253-260.

6 WILLETT P. Document retrieval experiments using indexing vocabularies of varying size. II. Hashing, truncation, digram and trigram encoding of index terms. Journal of Documentation 35 C43 , Dec 1979, 296-305. 7 LENN0N M and others. On evaluation of some conflation algorithms for Information Retrieval. Journal of
Information Science 3, 1981, 177-183.

8 NORERULT T, KOLL M and McGILL M J. Automatic ranked output from Boolean searches in SIRE. Journal of the
American Society for Information Science 28, 1377, 333-

339. 9 MORRIS R and CHERRY L L. Computer detection of typographical errors. Bell Laboratories Computing Science
Technical Report, 18, 1974.

10 MORRIS R and CHERRY L L. Computer detection of typographical errors. IEEE Transactions Professional Communication PC-18 C1D, 54-63. 11 CORNEW R W. R statistical method of spelling correction.
Information Control 12, 1968, 79-93.

12 ULLMRNN J R . R binary n-gram technique for automatic correction of substitution, deletion, insertion and reversal errors in words. Computer Journal 20, (23, 1977, 141-7.

-53-

Fu^^y mmt^bzmg ©yd §p©(L(l£©g ©®^©®ti@© NUSSBRUM P ©yd SCOEK H Jo ^yte^t&e © F P @ F detection i© ©©typ©l l®op®§© ^©^(gJs CP©p©pt TO ?©o0BoO0S])o IBM 0©id©Lb©rg S®i©©titi© 0©yt©p0 DD?©o 14 POLLOCK J J ©cud 2RMDRP P c C®Li@cft£©o ©©d
ehinetir^ £g©t£©in) © f © p y l t l i y p ©IPP©©© i n ® © £ © y t i f i © ©rud © y b © t © r l y t d ^ t o J @ u ^ y © l ® f tb® ^<§(p£©<f)o S o c i e t y f ® ^ X o f © ^ / © © t i © o fed©©©© 3 4 C13 0 0 © o 1? i o s)D - § © .

POLLOCK J J © o d lOOOPP PL l y © t © i © d © © i g © f o r d © t © © t i © y © y d ©@(f© j ©®ti©y ® f §p©(ll©yg) d r r ( s ) r § £(© © © £ © y t i t i © ©tnd ©yh©l©p(ly teL 0 ® y y y ® l ©¥ t P © Pyxy^i©©® S @ ® i © t y f ® ^ I(Rif(g)FM}^t£(o)[ni §©£©©©© PS C23)0 USSM^ M=H©Dc I P P B P P Ho P y t y y y t i © dl@t©e4£©(n) ©trod ® @ r t P © y t i ® y © f © p © l l i © s @p[p@p© £ y © (L©pg}© d ® t © (b©@©o dOy/yy©(l © f t t e tRmm^^^m 0 ® © £ © t y f©(p O y f y ^ y y t E ® © ScOd?oc© 3D C13 0 1 0 0 0 fl E 1 ° S 7 o
1 ? P1SE0PP5 E M © y d PRISON ( ) (Ft0 ( ) © ® y t © y t y © L R R

p®i§tpr®©©i§iiing ©y©t©©3 f o r ©try©© ©s)PP@(gt£®(© y © i © g b i y y r y n^gp©©© 0 I E E E ?(p©og®et£®o§ ®© £ © e p y t ® p g C~23 COS 0 Pfey 1 S 7 4 , 4 8 0 ° 4 3 3 o ire

ia

E P 0 L 5 0 N 0o T © ® h y £ q y © ^ f ® p r © p l © © i © g ©P©p©©t©y© t t e t g©rb(L®d ®y i y p y 1 0 I © s 1/9ES Sptpimg J©£{nt C@mput e 0@©f ©p©®®© 0 (RlfZPS £®©f(^©©<g© Py©©©©d£©g^ 2 0 , 19S6o 0 ® © P £ y g t © y 0 0 § S p a r t a n , 1SSS, 1SS~1S2o ®©oypyt©p ©o @©p f ©p n i m d i
0

ilS OEM ECl £ T o B i v © y©ytr M©y 1 3 8 0 , 1 3 S - 2 0 O O

BYTE 2

10 CLPRKE Do i o u o d g f © y £ i l i © f y P © b p y © r y 11084© 9 0 - 3 3 0

Pr<D©t£©©)l

Co^pytiog?

7,

U\ 0 0 0 0 0 5 J Ro F i n d i n g w^rdm thmf a l g o r i t h m . POPE 7d P © r © h 1 3 8 2 , 2

s©yod ©Lik© 473-474.

s the

Souodi

SUNFER 0 Bo P S t © t i ^ t £ y © t Rppp^m^h t © P©y®pd L i o f o a g © 0 t t © y © s 0 © y i © i ® y Syp©©y ©f S t a t i s t i c s 0 13S70 S0NTER R B © y d P E 0 L E B 1 1 P C fin © p f i n © l F t e o i ^ y ®F P©y®pd L i oft? ©pi© O t t © y © g D © ^ i © i © o llyp©©© © f S t g t i m t i © © , 13B7o FEN1CPE0 P P © y d EPPPETT 0 0 o P© © p p L £ © © t i ® y - i © d © p © n d © n t ©yb©y©t©©i] f © p f r © © - t © ^ t ©®©0y©i©o C^mputm^m ©®d iO@~ ^ © d £ © © l P©§©©^©P S , 1 S 7 B , 1 S S ~ 1 S 7 0 EPBP 0 No S t r i y g © i ^ i ( L © r i t y © y d u©i©©p©t L i n g © 0 G ® « y ® £ y © t i @ © i i ® f t h © POP D® C S 3 , Pl©y 1 S B 7 , 3 0 2 = 3 1 3 . P4 O 0 5 0 P 0 o P © t p £ © v © L © f © i © © p © L L © d ©©/©©© i © ©n © i r L £ o © i p©©©©ng©r p y © © r d ©y©t©m 0 0©(©©7y©i®©ti©©s ® f thm POP S C33, P©r©h ,, 1 B 3 - 1 7 1 0

17 BLiPlP 0 R 0 0 p r ® g r © m f®o lof®^©()©t£@© © y d C o o t r y i

^rr©©tin© np©LLi©g 1DBGL BQ-B7o

t p r o n

_©© =

5 Fuzzy

matching

and

spelling

correction

28 DOMERRU F J. R technique for computer detection and correction of spelling errors. Communications of the ROM 7 C33, March 1964, 171-176. 29 GREENFIELD R H. On experiment to measure the performance of phonetic key compression retrieval schemes. Methods
Of Information in Medicine 16 C43, 1977, 2 3 0 - 2 3 3 .

30 MOORE F J. Mechanizing a large register of first order
patient d a t a . Methods of Information in Medicine 4,

1965, 1-10. 31 J0NE5 R M. Improving Okapi : transaction log analysis of failed searches in an online catalogue. VINE 62, May 1986, 3-13. 32 KELLY B and others.
System. Information

Bibliographic Recess & Control
Technology and Libraries 1 C23 , J u n e

1982, 125-132. 33 COCHRPNE P R. "Friendly" catalog forgives user errors : no librarian intervention necessary on dream online system called "Paperchase". Qmerican Libraries 13 C5D , May 1982, 303-306. 34 J05EPH D M and WONG R L. Correction of misspellings and typographical errors in a Free-Text Medical English Information Storage and Retrieval System. Methods of
Information in Medicine 16 C43 , 1979, 2 2 8 - 2 3 4 .

35 WONG R L and others. Profile of a dictionary cornpxled from scanning over one million words of surgical pathology narrative text. Computers and Biomedical Research 13, 1980, 382-398. 36 R0UGHT0N K G and TYCK0S0N D R. Browsing with sound : sound-based codes and automated authority control.
Informatxon Technology and Libraries 4 C23, J u n e 1985,

130-136.

-55-