63 6.1 D e s i g n Introduction ^r-icd i m p L e r n e n t a t i o n This chapter describes the retrieval devices to be evaluated in this project. These are: stemming and spelling standardisation the use of a lookup table of cross references and "go" phrases a method of suggesting corrections for words which may be misspelt The devices were used in two catalogues referred to as EXP and CTL. EXP incorporates all the devices, including twolevel stemming. C7L has "weak" stemming, but none of the other devices. The catalogues appear identical in casual use. The catalogue systems are described here from the point of view of structure and implementation rather appearance. The appearance and presentation of the catalogues is described in Chapter 7. Some may prefer to read this chapter and Chapter 7 in parallel. This chapter also describes the method of term combination which we used. This is combinatorial rather than boolean. Most post-coordinate retrieval systems either use explicit boolean OND and OR or, more frequently in online catalogues, an implicit boolean PND. Okapi systems, like CITE C2.53 and the "keyword" subject enquiry in the SWOLCRP LIBEKTP5 system, retrieve all records which contain enough of the words in the user's search, outputting the "most similar" records first. 6.2 Stemming and spelling standardisation 6.2.1 Background Outomatic stemming has been investigated many times for reference retrieval systems but is only used in a few online catalogues Csee Chapter 3 for a survey). In reference retrieval systems most researchers have concluded that it is beneficial. It appears to increase recall without unduly decreasing precision, and its performance is said to be comparable with that of manual truncation by intermediaries. When stemming has been used in online catalogues these have usually been catalogues which access -57- fc> Design and impLementation specialised coLLections. The prime example is the National Library of Medicine's CITE [13. We do not know of any online catalogue which uses stemming in accessing a general collection for general users. It seems likely that one of the reasons for the apparent success of automatic stemming in reference retrieval systems is that the searcher is generally trying to retrieve one or more sets of records for printing offline. These output sets Bre usually produced by boolean combination of a number of intermediate sets. It is perhaps not too much of a generalisation to say that the searcher (intermediary) seeks to construct a canonical formulation of a search. It may not matter if some of the intermediate sets or single term searches contain a substantial proportion of false drops attributable to stemming. The use of general online catalogues is very different. There is no need, and rarely any attempt, to find a "definitive" search statement. Most search statements consist of only one or two words Csee, for instance, Table 8.3, which shows that the most frequent number of terms is two, with a mean of about 2.2; Okapi users are quite typical in this respectD. R session at the catalogue often consists of a sequence of related searches, any of which may contribute to the user's satisfaction or lack of it. When automatic stemming is used on searches consisting of only one or two words even the most conservative procedure often gives false drops which would have been RNDed out in a longer search. Even conflating singular and plural forms can lead to a good deal of noise: examples drawn almost at random from Ukapi '84 transaction logs are right, rights; a g e , ages; Lord, account, accounts; age, aging, aged; [house of] Lords; communication, account, mass, masses; communications. market, art, The marketing; arts; removal of "ing" and "ed" endings can also be detrimental: accounting; train, training. Pn examination of any word list drawn from a collection of noun phrases shows that other suffixes such as "er", "tion", "ence", "ism", "ist", "..ity" often affect the meaning very markedly, and that their removal can be expected to lead to a significant proportion of false drops. Most stemming evaluation has been done on searches of databases in the "hard" sciences. We looked at 6700 consecutive subject searches Cafter removal of immediate repetitions of a search) from the logs of an Dkapi '84 terminal to try to get an idea how often stemming would be detrimental and how often beneficial. The types of word looked at were regular English plurals, "ed" endings, "ing" and "ings" endings, and "tion" and "sion" endings and their plurals. The results are summarised in Table 6.1. Hfter removal of very common function words and stop words, -58- 6 Design and implementation these search statements contain 3585 distinct words Ctypes 3 and 14,584 occurrences of the words Ctokens3, a mean of 2.2 words per search. CFor aLL tokens, the mean number per search statement is about 2.5.3 Table 6.1 Types of word used in subject searches COkapi '843 Types Tokens 5eardies regular plurals "ed1 endings "ingCs)1 endings •tionCs}1 and •sion(s)1 endings 715 (19.9%) 56 (1.5X3 2201 (15.11) 125 (0.91) 638 1107 (4.4%) (7.6%) 177 (4.91) 255 (7.1%) nonce-words (excluding nunbers) misspellings and rubbish 1833 (51.1%) 565 (15.8%) 1833 (12.6%) 650 (4.5%) Total JDOD 14584 6692 The table shows that plurals are far less frequent than singulars, that "ed" endings are probably rare enough to be disregarded, but that some other suffixes occur quite often. Rny stemming procedure has to be applied both to the index language and the search language. The above breakdown could be usefully compared with a corresponding table compiled from the language of bibliographic records, with title words and controlled subject headings treated separately. Unfortunately we have not had time to do this. We would expect to find that the distribution of suffixes in search statements is quite similar to that in titles, but markedly different from their distribution in subject headings. Because subject headings are for description and recognition rather than for searching, they have their own "subgrammar" One feature of this is an extensive use of plurals where searchers and authors would use singulars. CFor an enlightening and depressing discussion of Library of Congress subject heading practice see Mischo [23. -59- 6 Design and implementation Schabas [33 found that the performance of PRECIS was similar to LCSH when used for keyword searching.3 6.2.2 Functional design considerations Mitev and Walker [43 asserted that the first records shown should be those Cif anyO containing the exact phrase entered by the user. The search system should then, if necessary, look for records which partiaLLy match or bear some resembLance to the search statement. They hold that a search should be automatically broadened when appropriate (user is not satisfied, no records found, only a few records found} by any or all of the following means: 1 2 3 4 stemming relaxing the word order constraint using word fragments looking for records which contain only some of the words of the search. In this research we were not concerned with word order, except in so far as one of the catalogue systems CEXP3 uses a small dictionary of phrases. The Okapi indexes do not contain adjacency information, and Okapi suffers from the usual "false coordination" effects. CTry LIBRHRY SCIENCE or WAR RND PERCE on any catalogue which does an implicit HND on title and subject words.3' Nor do we use word fragments. See Chapter 4 for a discussion of the use of n-grams for finding classes of morphologically similar words. The fourth heading - being prepared to display records which may not contain all the words of the search statement - is a feature of CITE, Okapi and of the 5WRLCRP LIBERTRS system. Rgain, it is one which has been investigated in reference retrieval systems with generally favourable results. It is not the subject of this project, although there is a strong connection between the way we used stemming and the way in which terms are combined or "merged". This merge procedure is described in 6.5. 6.2.3 Strong and weak stemming Ideally the system should first look for records which match the actual words of a search, without any stemming. This would be followed by "weak" stemming Cconflation of plural, singular and "ing"3, and this in turn by "strong" stemming. The degree of stemming, if any, to be applied initially might be made to depend on the number of words in the search and possibly on their frequency in the index. For example the search 5UCCES5 HI INTERVIEWS finds no -60- 6 Design and implementation matches in the PCL file when records containing "success" and "interviews" are sought, and still fails when plurals and singulars are conflated. But when stronger stemming removes the suffix from "successful" it finds a book with the title "How to succeed at an interview : .. a guide to successful preparation for job applications .. ". CJOB 0PPLIC0TI0N5 is a better way of expressing this search. It finds 11 books, all but one of which are likely to be relevant. Having found the above entry to the file a user might decide to try this reformulation of the search.} Rnother example is TERMINPL ILLNE55 which finds no records without stemming, but the file contains four relevant records indexed under "terminally" and "ill". Pny catalogue which does not find these records in response to these search statements is, to put it mildly, inadequate. Few users would try rephrasing the last example as THE TERMINALLY ILL. 6.2.4 Spelling standardisation Many common words have alternative acceptable spellings. Most of these represent differences between British and Rmerican English. Even within British English there are judgment/ judgement , connexion/connection and m a n y o t h e r s . R large proportion of the differences can be given in the form of rules rather than tables, although most of the rules ought to have exceptions. Replacing every iz by is makes a few words such as "size" look rather odd. This is one of the reasons why we decided not to show the user how the system represented search words. We had to design the interaction so that links are maintained between what the user types and what the system is processing. Rare variations C"ium" = "urn" at end if word is "aluminium" 3 are handled by lookup tables CB.3). Since the spelling standardisation was to be rule-based we decided to incorporate it in the stemming procedure. The rules were concocted after a study of past searches on Okapi, and of lists such as that given by Paice C5, p B B ] . 6.2.5 Two levels of stemming There are obvious examples Cright, rights etc3 which show that it is not always safe to apply any stemming at all, but for reasons of simplicity, economy and ease of evaluation we chose to apply a "weak" transformation, which included the spelling standardisation, to all searches. U/e hoped that few searches would be seriously affected by the -61- & fc©ig© © o d £mpLmmm(n)tmiti>(Q)(n) yy©©ybiti©y©L f ©P©© o ©©i©fl©ii®iy ©t p l y r d i l ^ © i y g y l © ^ ©yd °iyga k% tfey ted 4© dfecidte b®^© 4© ©©qu^yx©® thd ©ppli©©ii@y ©f th© i©©) (L©v@Ls ° ©b®yLd fb© ©©©©yd L©v©L ©yLy b© ©ppli©d if p©qy@@^®dl0 © r if th© fir©i ©lid ©©if L©©d f© ©yy m i r i © © © ! © (3 ©®yf©i©iyg ©Li ib© ©i©©o© i© ib© ©©©p©h? Th© o©©4h©©l cy§©d i© Qk®pi °&B i© ®tztF®mmly ©i©pl©^ but ©@©Ld y®i lb© ©ppfi©d i© © ©y©i©© ©bi©b CLik© ©o@©i ©yyp©yf ©yfiy© ©©i©L®gy©©3 d®©© ©y £(Mpl£©d] b®@L©©y (RNB @© fh© ©©rd© ©f th© HS©ib ©x©©k ©©id ©irayg ©f©©o© ©y© g©©©[p©f©d ©©d L©@k©d ©p i© th© ©ppir)©p©i©f© iyb©y@©o E©©h ©fpyyg ©f©[© i© ©)©©k©d ©© b©iyg D@v?ranpji®y©° ©©lib it© © © ^ © © p y y d i y g ©;©©k ©i©yy E©©h ©i©y i© ©y©igy©d © ©©igbt b©©©d ©© th© io©©p©© ©f it© f©©qy©©©y© ©© ©©©c©©)© ©?®rd© ©©© ©xyrih L©§© C°Ligbt©p D 3 fh©n p©(p© ©y©© 0 I© p©pii©yt©ry © ©fp©og ©i©© g©y©ir©LLy b©© © (L@y©p ^©©ighi th©© it© ©o©r©©p©ydiyg ©?©©k ©i©©y Th© p©©fing Li©i© f©© ©11 th© ©f©y© ©p© th©© ©i)©pg©dfl ©©©h p@©ti©g in fh© ©ytpyf Li©i b©iyg gi©©© © ©©igbi ©hi©b i© th© mum @f th© ©©igbf© ©f th© ©©yfpidyfi©g ©i©©©« @^©©pi fb©t th© ^©©ight ©f © ©fp@yg ©t©y i© y@f ©dd©d if th© ©@pp©©pending ©?©©k ©i©y h©© _; . do Thi© ©v@id© giving f©l©©ty high y©igbi© f© p©y®pd© tfhiyh ©@©t©i©n for ©^©njpLd© b®tb °§yce§§§° ©nd °©y©c©©©fyL ° 0 CS©© I d fyr © fyLl©© d©©©[pipfi©© ©f th© ©©pg© pp®y©dyy©o Th© pr©©©dyr© d©©®rib©d ©b©v© i© @©Ly ©©©d i© th© ©yp©ri(©©yfa§l C/OCP3 ©y©f©y ° th© ©®yfr©t C C F O ©v©t©© ©yty ©©©© ©©©k ©t©©)©o3) Pddiii®y©lLy^ p@©fi©@© © P © ©@f ©©©©pf©d f©© th© ©yfpyt List y©l©©© ib©ir y©igbt p©©yb©© © ©©pf©i© ibp©©b®ld b©^©d ©n th© ©>©ighf ©f m hyp®ib©ti©©l r@©@pd ©©of©i©i©g ©11 th© ©?©©k ©i©y© i© th© ©©©r©ho Thi© ©y©yp©© th©f p©y©rd© b©©ring LittL© [p©©©©bL©©©© f© th© ©©©p©h m\rm ©of r©fri©y©d 0 BoZoS ^©f©^©©^a@© d@©ig© H©i© ©h©yld th© ©t©o©©i©g b© pr©©©©f©d t© th© y©©r? th© ©y©t©© ©^ph©i© ©h©f if i© d@i©g? If © @ 5 h©^f? Should fteny y©©r© d© ©@t k©©©; fh©t © k©©^©©^d ©©t©L©gy© L@©k© for °wordi° ©©th©(p th©© °id©©©° © P °©©©©©pf© a o CSono© ©f th©©© ©h© d© ^©©Li©© [p@yghL© ©h©i i© g©i©g ©© r©g©pd it Crightly3 ©© D y©i©f©LLig©©t D o ] Th© ©©f©L@gy© ©h@yLd gi©© ©©(©© iotr©) dy©t©PV ©^pL©©©fi@© @f h©©? if ©ypk© 0 byf thi© ©xy©f b© v©py bpi©f ©yd pithy© It i© y©d©©i^©bL© f© pp©©©©t C©^©©pt ©y r©gy©©f3 ©oyp© th©y ©y ©b©@Lyf© mi\ni,m^um ©f y©y~©©©©nfi©L f©yf ©iyy© it i© i©p©©©ibL© t© mmkm p©@pL© r©©d th© ©yr©©n 0 1y?© Li©©© ©t ©©pL©y©fi©y i© ©b@yf th© ©©©i©y©y 6 Design and implementation txamples: This catalogue Looks for items containing the words you type in and other words similar to them The computer will look for books whose TITLE and SUBJECT t£HHNB5 contain as many as possible of the words you type B o t h e x a m p l e s a r e intended to suggest that the c o m p u t e r Looks for w o r d s rather than p h r a s e s . The first is about a s far a s o n e c a n g o in c a s u a l l y i n t r o d u c i n g the idea of stemming. T h e second c a n b e used for "best m a t c h " s y s t e m s such a s O k a p i w h i c h d o not do a n implicit R N D . W e h a v e not been a b l e to think of a n y m e s s a g e w h i c h is short e n o u g h and introduces both features. O n e f u n c t i o n of e x p l a n a t i o n is to try to avoid u s e r s b e i n g put off w h e n a s e a r c h fails or r e t r i e v e s u n e x p e c t e d items C'That isn't what I asked f o r ! B 3 . When a search fails Cto find anything} the system should be able to give some sort of explanation at that poxnt, but it cannot - by definition - know when it has displayed false drops. It is not so much to excuse the system but to reassure the unsuccessful user. It is important that a catalogue should not often make a fool of itself, but it is more important that the catalogue should not make a fool of the user. Pnother function of explanation is to help users improve their search techniques. The fact that the system is searching for words rather than concepts is more likely to be appreciated if it displays each word as it is searching for it. Further, if the search should fail, some users may have noticed which words are likely to be useful and which are not: this may help them if they have to rephrase their search. Presentation during searching must depend partly on the output from the stemming and spelling standardisation. If we had used a procedure which performed "normalisation" on the stems it produced, so that they are always "real" words, it would be possible to display the stemmed words to the user. For example, if "computing" weak stems to "compute" and "computer" and "computation" strong stem to "compute" the system could display Looking up "compute..- 6 Design and implementation or even Looking up •compute' f o L L o w e d by Found 2317 books under "compute" and simitar words but one cannot display "comput", "censu" or "filosofi", which are forms produced by our stemming and spelling standardisation. CWe also tried using plus signs to indicate truncation. The only concise and transparent ways of suggesting that a word or phrase has been truncated, or that some of it is "missing", seem to be two or more dots or the word "etc".3 6.2.7 Choice of stemming procedure We used a version of Martin Porter's suffix removal algorithm [63 because it is short Ceasy to code and does not need much memory}, tends to "under-stern" and has been shown to perform well in reference retrieval searching C7, 8 3 . This algorithm removes apparent "s" plurals followed by "ed" and "ing" from words of any length, and then successively removes other suffixes if the remaining stem would be long enough Cthe measure of length is a quasi-syllabic measure based on the number of vowel-consonant transitions). Most final double consonants are reduced to single, so that "travel" and "travelling" etc are properly conflated. We adapted Porter's procedure, making it a two stage process and adding spelling standardisation in Stage 1. The Stage 1 process is intended to be non-contentious that is, its application should rarely affect the meaning of a search. It is essentially Step 1 of the Porter algorithm with spelling standardisations. We refer to this as "weak stemming". The second stage involves carrying out 5teps 2-5 of the original Porter algorithm; this we call "strong stemming". Thus any word has both a weak and a strong stem: the weak stem is considered "closer" to the original word. 6.2.8 Stage one - weak stemming standardisation and spelling This stage removes regular English plurals and "ed" and "ing" endings, then reduces most double consonant endings to single. No endings are removed from words under four letters long or from "words" which contain digits or other non-alphabetic characters Chyphens have already been squashed and/or replaced by blanks). -R4- 6 Design and implementation There is an exception tabLe containing one entry: the word •united" CUnited States, United Kingdom etc.3. Conflating •united" with "unit - Leads to false drops. For other words, Step 1 of the original Porter algorithm is done, followed by these spelling standardisations - an example follows each. In many cases they cope with differences between American and British spellings of words; in some cases they trap common spelling errors. 1 iz --> is Corganize --> organise} ae --> e except at the end of a word Corthopaedic --> orthopedic} ph --> f Csulphur --> oe — > e Cfoetus — > 2 3 sulfur) fetus} 4 5 our --> or if word Length greater than five Cbehaviour --> behavior) exion --> ection at the end of a word Cconnexion --> connection} nse --> nee at the end of a word Cdefense --> defence) amme --> am at the end of a word Cprogramme --> program) gue --> g at end [catalogue --> catalog) ism --> ist at end Cfeminism — > feminist) ant --> ent at end Cdependant — > dependent) tre --> ter at end Ccentre --> center) anc* --> enc* at end if length > 6 C* denotes one Letter or no letters) Cdependance --> dependence) 6 7 8 3 10 11 12 13 For many of the changes it doesn't really matter which way round they are done: for example we could equally well do "ent11 --> "ant" at the end of a word. It is more a case of making words equivalent. _P:I^_ S fc©ig© ©nd £,mpl(§(w)(B(n)t©pd© © f t © © (L©©d© t © f © t © © dlP©p@o ( ) g©©(d ©^©(nnpl® i © R D ©©pi©yni©© a ©nd a ©@rn©ynd©©f i@n©° ©?bi©b b © i b © t p © n g ©t©m t o Q c©o©©y©° a Rin©ih©p © ^ © n p l © i © ° © p g © n i © © i i ® n C 3 mmti G ® p g © n i e r o 0 Whmn © © © p y b i n g f o r iy®pd© ©© ©©©p©h f © p b © f h ©?©©k ©nb © t p p n y ©i©m© fl b u t thm wmmk mtmmm © P © g i © © n © b i g h © p ^ © i g h t i b © n thm © f p © n g © n © s 0 te h©y© t p i © d t © i © k © ©©©©y©f © f ©©©j©©n ©p©I l i n g v © p i ~ © i i © n © 5 © ' © i n l y d l i f f © r © n y © © b©t©©©n l p i t i © h mnd R n © p i © © n ©p©tlingi©0 Sto ©I©© t r y t © ©©p© w i t h ^ ® p d § w i t h PD©P© f b © n ®n© © c © © p t © b L © © p © l l i n g 5 © D g 0 ©p©h©©©(L©gy© ©p©h©®l®gy©° f(Q)&iumD tfmtuBi (©©bi©©©©! B n © b i © © ® l 0 l b © ©b©ngi©g ©f © © d i n g © D i n t ° © n d " © n © ^ 0 t © 0 © n i G ©nd ° © o e S a p © © p © © t i © © l y £© b©c©us© y©©p© © f t © n o©©k© t©i©t©k©© © d i h ©y©ib © n d i n g © B ©nd wb©p© © i t b © p © n d i n g ©©n b© ©®pp©©i f b © ©©nd© © P © © i n i i l © p © n © y g b i n ©D©©n©ng t © j u s t i f y f p © © t i n g i h © n t b © ©©m©0 CEK©mpL©s ©p© i©b©p©od©o©© ^ i©b©p©©b©©©© 5 e o r ^ g i p o o d i n t 0 ° ©©pp©^p©©d©ot o 3 Word© © f t © n L o o k r © t b © p © t p © n g © ^©h©n ©©©k ©p s t p o n g ©f©r©©ti©do Q k © p i y©©© © f © © © i n g i n i f © £nd©K ©nd f b © n © g © i n ©n t b © ©©©r^© ^©©p©b ©f ©t©m©nt ^ b u t tb©m© ©t©-m© ©r© n©v©p © © f y © L L y d i © p L © y © d t © t h © wm^mr „ ©© f b © i p ©pp©^ip©n©© do©© ©©t o©©tt©Po F @ P (mnmmpi(md © © g © t ° f © t © g p © f i a fp©©) °ph@t© r a ? a g p © p b y 0 hut t b © f©p©)©p © n L y § © P V © © ©© O k © p i / ] © £ n f © p n © L p © p p © © © n t © f i © n © f f b © t©p(©s ©© im^ ©© t b © y©©p i © © y n c © p n © d f b © ©©©Dpyt©p £© © © © p y h i n g f © P ° p b © f © g p © p h y ° 0 S p © ! L l i n g © f © n b © p d i © © t © © n © f t © n ©©t© ©n ©?©pd© i n y n i n t © n d © d ^©©y© o P g © i n 5 i n ©]©©t © i f y © f i © n © t b i © b@©©n 5 f ( © © t t © p 0 Th© ©?©rd °yph©©y©(L 0 ©f©t©© f © G y f © © y © L ° ° but ©© (L©ng ©© n© © t b © p w r d l ©t©©)© t © a y f © © v © L ° i f ©/iLL ©©y©© n© pp©bL©(n©o SOOYD© © t b © p y n i n f © n d © d © L f © p © f i @ n © ©P©S t©n©©© L©nn© pi§oyrci p(g>^©p© ©© r ©p L©n© ©p ©p L ©n© ypbilt yfiLL -08- 6 Design dizzy advance diszi advence and implementation In practice, none of these Cnor many others} matter, but there are a few which should be treated as exceptions, Li "united" Csee Some oddities below}. We decided that record retrieval would be improved by treating "ism" and "ist" at the end of a word as equivalent. Examples include socialism, socialist; Marxism, Marxist; structuralism, structuralist; racism, racist; fascism, fascist. 6.2.11 Some oddities Inevitably some odd things happen with stemming. Some of these happened with the original algorithm but others are direct consequence of our adaptations. These are a few o the transformations which may decrease precision or even recalI: herring organism poetry poetCs3 shoes schism her organist petri pet she schist Rnd place names do not always Lend themselves readily to stemming: Woking Dungeness woke dung -67- 6 Design and List impiementation 6.3 Phrases and the go/see The Okapi go/see List contains classes of terms which we treat as equivalent, and phrases which we treat as single terms. It was compiled after study of a large number of transaction logs of the use of Okapi '64 in Riding House 5treet Library, i.e. the emphasis is on the words and phrases that are most likely to be searched for in Okapi. Rn "equivalence class" of terms is a List of words Cor phrasesD which we treat as having the same meaning at search time. For example, "France, French" may comprise one class, and •VDU, Visual Display Unit, VDT, Visual Display Terminal" another. R special case is when the list has only one item, which will be a phrase we want to treat as a complete phrase rather than as individual words. Examples of this are "industrial revolution" or "soap opera". In the index, any member of a given equivalence class will point to the same set of postings, which will be the set of all postings which contain at least one of the class terms. Thus a search for "France" may retrieve records which contain the word "French", and vice \yersa. 6.3.1 Categories of equivalence classes The classes in the go I see List may be roughly categorised into the following eight groups, although some classes belong to more than one group. 1 Abbreviations "BBC, British Broadcasting Corporation" "CND, Campaign for Nuclear Disarmament" "TV, Television" "VRT, Value added tax" 2 Noun/adjective pairs - mainly for proper names "Switzerland, Swiss" "Wales, Welsh" •Florence, Florentine" "Freud, Freudian" etc. 3 Alternative terminology "Russia, USSR, Soviet Union, Russian ... Cetc.3" "movies, moving pictures" •Holland, Netherlands, Dutch" •Conservative Party, Tory Party* Included here are words terms which may be one word or two -66- 6 Design and impiementation "micro "ultra "infra "Serbo 4 computers, microcomputers" violet, ultraviolet" red, infrared" Croat, Serbocroat" PLternative spellings "gaol, jail" "csar, czar, tsar, tzar" "aluminium, aluminum" "aeroplane, airplane" 5 Common irregular plurals "man, men" "woman, women" "phenomenon, phenomena" "wife, wives" 6 Related terms "child, children, childhood, childish" "tax, taxation, taxability, taxable" "Third World, underdeveloped countries, developing countries, .. " 7 Numbers "6, six, sixth, vi" Cand similarly for other numbers up to 203 8 Phrases •Pirate Radio" "Labour Party" "Official 5ecrets Ret" •Notting Hill" •General Strike" 6.3.2 Phrases The use of phrases poses some problems, not least of which is the question of what to include. The main use of phrases is in keeping together words which would lead to false coordination if separated. However, care must be taken in selecting phrases for the list because the inclusion of a phrase in the go/see list can sometimes Lead to a reduction in the number of relevant books found. The classic example of a "good" phrase to have on the list is "soap opera", because books on "soap" alone or "opera" alone would not be relevant in such a search. In practice, a search for just "soap opera" would retrieve relevant books even without the phrase list, but as soon as any -69- S) Qmmigin ©od i©p(l©(©©©t©f i@© ©fh©p t@P(m)© © P © i n © l y d © d i© tb© © o © p © b ih©p© i§ © d©ng©p ©f y © p d © b © o © © y © g ©©p©p©i©do For ©^©pnpl© 0 © § © ^ p © h for °fb© p p © d y © f i o © ®i © © © p © p © p © m a © ® y l d © q y © l l y y©ll p©fpi©v© b©©k© ©© °©@©p p p © d y © t i ® © D ©p °pp©dy©ti@© ©f © P ^ P G U H 0 0 ®j©iih©p ©f ybiohy y© ©©©y©©y y © y l d h © l p th© u©©r ©«y©bo Si»il©plvh i^1 poo^i ©ih©p ©©©©© c fb©p© is lifft© t© b© g p i o o d f>©©) y © i © @ fh© p h p © © © ti©i if th© p h r @ § © i© th© o n l y ihi©g y© fte §§§reho © y i y h © y ® t h © p t©Pom@ © P © © d d © d th© p p © © © o © © ©f p h p © © © © i© th© i © d © y b©lp© i @ p p © y © y f f©l©>© ©@@p©li©©f i © y 0 KhlGP PUREES? P p h p © © © yfp©©© © © © © i i i y o o i y©^pd© b © y © ©o©©©i©g© © © P V d i f f © p © y f fp©M) fb©t ®f ib© y(hi©(L© p b p © § © i© ©© ib©©L © © © d i d © i © f@p i©©[Ly§i©© i© ©yr (Li©io P1@y©y©p D ©)©©i © © © © © © p © y©f ©© © L © © p - © y i y© °©@©p ©p(m\f"^0 0 1© ©)©©y i o © i © o © © © ©OPD© ©p ©11 ©f th© ©@©p©n©o©& y © p d © © P © p©t©f©d i© © O P D © y © y i @ fh© p h p © © © ^tdpingo F©p © ^ © y p l © G © o © © i d © p th© p h p © © © °pip©t© podioh < b@®k© © o a p © d i © ° © l o o © plight b© p©L©v©nt 0 = y h © p © © © b©ok© ©© °pip©f©© a mlma%i © © p f © i n l v "©©old not „ P h p © © © © y b i o b © o n t o i © V © P V g © o © p © l y © p d © ©h©yLd b© i n o l o d o d b © y © y y © tb©y p p @ v i d © © y p l © © p p o p f y y i f i © ^ fop f©l©© © o ©pdinotiony Pf)©lf©p© Shot©" 0- D © d d d l © © l © © © ° 0 °b©y©p©l S f p i k © a mnd °©o©i©l © © i © o © © G y y y l d ©©[©© iof© thi© © © i © g © p y 0 p © © © p y h f©p °H)iddl© ©L©©© © d y © © t i © o ° © © y L d p © f p i © v © b o o k s ©y °©dyy©ti©© i© P i d d l © © © h @ o l © a © P °©dyy©ti©n i© t©iK©d © b i l i t y eli§§@§°o S i m i l a r l y © y©©p p © p y © © t i n g D t o m © n ©nd a th© (#©(Lf©p© S t © t © plight b© © f f © p © d © b@©k @© th© y^©tf©p© ©f y©pii©© i© kfey Y©pik St©t©o If y© ©©© fp©p k n o w n p h p © © © © © y © h ©© th© ©b©y©^ y© yiLL p©dy©© th© p p @ b © b i L i t y ©f fhe^© f©!L©© d p © p © Q p o l i t i c © ! p © p t i © © C°L©b©yp P©ipfy0 © f © o } b © c © y © © °P©pfy° y © n lily ©tterti it©©If t© th© y p y y g y©pdo P © © © p © h for D E C L I N E OF THE L P B O U R P P R T Y p © f p i © v © d c T h © d © c l i n © ©f th© L i b © r © l P © p t y 1S10-1S310 o i©m© m inctudtci p t v h 3 Lib©p©L P S©{©© p h p © © © © © h © y l d ©@f b© i © © l y d © d i© th© g o / s © © li©ts fh©y © P © p h p © © © © y h i © h ©©© b© C©od © P © lik©ly f© b©3 © K p p © © © © d i© © l f © p © © t i v © y©y© i© th© p © © @ p d © G F O P ©^©mpl©D a F p © y © h P © v @ l y t i © © a u©ight b© © © n © i d © p © d © y i f © b l © f(^v i©© t y © i o © i© th© Li©to C © © t © i © l y fhi© y o y l d l©©d to ©11 b©@p© @© th© F p © y © h P © y © l y t i © o b © i © g p © t p i © y © d if fhi© topi© y©© © © © p © h © d f © p 0 h © y © v © p 0 yh©t y © y l d not b© f o u n d © p © ©II fh© ©pjy©ltv P © l © v © o t bo©k© y i f h fit I©© ©p © y b j © © f h o o d i o g © © y © h ©© °P©v©lyfi©y i© F r § P E i f D f b © F p © o © h © y d Py©©i©y P©yolyfi@y©D ©p °Frm^i(^m d o p i n g fh© p©y©lyfioc ©f©0 W i t h ©11 p h p © © © © fh©p© i© ©©PD© d © n g © p of fh© © b o y © ©ify©fi@n ©pi©i©g0 F O P © y © p p l © 5 if y© i©©lyd(^ th© phrai 70- 6 Design and impiementation "World War II" then we would miss books on "World Wars I and II". However, in such a case the advantages of including the phrase probably outweigh the disadvantages. With any phrase, the decision whether or not to include it must be a fairly subjective one, and we must rely on analysis of users' subject searches to measure the usefulness or otherwise of our various choices. 6.3.3 Problems in searching for phrases Rt search time, we always scan the index for the longest match possible with the user's search string, thus picking up phrases where they occur. R "term" in the search may therefore be either a single word or a phrase. When the postings are merged, inverse frequency weights are assigned to the terms. Rt the moment we have no special way of assigning weights to go/see list terms, and their weights may therefore be a little artificial. There is a danger, for example, oi' terms being given artificially low weights because they belong to an equivalence class and are thus highly posted. Because the list of postings for "Spain" also includes all postings for "Spanish", "Hispanic" and "Espana", the weight assigned to "Spain" in a search will be Lower than it would be if "Spain" were not on the go I see List. In most cases this will probably not have disastrous consequences but it is something which should be borne in mind, perhaps when compiling the go I see list. For the above reason, there seems to be a case for simply adding 1 to the weight of any go/see list term. In the case of phrases, a suitable way to assign a weight might be to calculate the weights of the component words, sum them, and then add 1. It might also be a good idea to actually add the individual words to the merge, because the weighting system just described would ensure that postings containing the actual phrase would have a higher weight than those only containing the individual words, with the added advantage that phrases like "French Revolution" and "World War II" would work better Csee discussion above). R disadvantage of applying techniques such as the above is that it makes the Calready complex) search string analysis even more complicated, and involves more index searching which takes a relatively long time. 6.3.4 Other go/see List categories We don't include abbreviations which are in themselves words, such as WHG or RID5, since this could lead to some very strange retrievals. Other equivalences are particularly useful for rare terms because relevant books will often be found which would not -71- 6 Design and impLementation otherwise have been. For example, there are very few books on Tibet, and all contain only one of the terms "Tibet" and "Tibetan". Thus by making these two terms equivalent we can retrieve all the books on Tibet rather than just those containing the word "Tibet". R search for "Tibetan religions" will find "Politics and Religion in Tibet". 6.3.5 fl note on stemming and indexing The index contains both "weak" and "strong" stems of words C6.2.3D. Originally, we intended that if a phrase was on the go I see list, only the strong stems of its component words in the index. However, there turned out to be serious problems with this approach and so now all words go in the index as both weak and strong stems regardless of whether they also form part of a phrase. The reason for the problems with the former method is that when a phrase's component words are relevant in themselves, they get "lost" if searched for on their own. For example, the phrase "Communist Party" is on the go I see list. Originally, this meant that although any books containing "communist party" would be posted under this phrase, they would only be posted under the strong stem of "communist". Thus they would only receive a low weight in a search such as "Communist politics" where they would be very relevant. -72- 6 D e s i g n and implementation 6.4 5pelling correction To undertake spelling correction interactively Cword-byword) on limited hardware it is essential to use a twostage process CChapter 53. The first requisite is a fairly large dictionary - large enough to contain the majority of words which users may misspell. Some systems store several categories of words: there may be a special dictionary of very common misspellings Linked to their correct forms Ce.g. RDN --> RND , NECCESORY CetcD --> NECESSARY3. This is the only practical way of correcting short words. Sometimes the dictionary is coupled to a set of rules for 'expanding - entries by adding plausible affixes CEXRGGERHTEDNESS might not be in the dictionary, but could nevertheless be accepted as an allowable w o r d ) . The dictionary may also be partitioned into specialised sections to be used in different contexts. The second requisite is a fast lookup procedure for selecting dictionary entries which are near enough to the user's word to be considered as possible corrections. The Lookup should return a list which balances the requirements of recall and precision. It should include all or almost all candidate replacements, but must be of manageable size because each word in the list is going to have to be matched against the user's word to produce either a single word for automatic correction or a short List for the user to select from. Finally, there is a matching algorithm which measures in some realistic way the likelihood of a given candidate word being a proper correction for the user's word. 6.4.1 Dictionary The obvious source for the dictionary is the bibliographic file itself . We first tried extracting all words from the subject-rich fields of the bibliographic source file. Since it is dangerous to try to correct short apparent misspellings, we selected only words of five letters or more, and excluded all numbers and "words" containing digits. There was a considerable proportion of non-English words. Experiments showed that such a system would sometimes suggest correction to a foreign word. Since our users only very rarely enter non-English words Cother than proper names) we tried to eliminate as many as possible by not extracting title words from records whose MORC 008 field contained anything other than "eng*. There were 12,3S0 "non-eng M records (.13% of the fileD. Excluding these reduced the size of the dictionary from 47,040 to 30,610. -? "-j 6 Design and implementation 6.4.2 Selection of candidate replacements Examination of Okapi '84 searches had shown that around two-thirds of misspellings retained the approximate consonant structure, and it is well documented that initial letters are rarely wrong [9, 103. We decided to use a soundex-type procedure in which the first letter of each word is left unchanged, vowels and a few consonants C'h- and •w 1 } are ignored, doubled consonants made single and the remaining consonants divided into approximate phonetic groups. If adjacent consonants are in the same group only a single code is output. Some information is retained from the vowel structure: if two consonants are in the same phonetic group but are separated by one or more vowels Cor "h" etc3 the code for that consonant group is output twice. Rlso, a vowel other than "e M at the end of a word is represented by a "vowel code". The algorithm is given in Rppendix 1. We also tried a similar procedure which the vowel structure by representing any of vowels by a "vowel code11. This gave replacement lists, but sometimes missed Ce.g. HUIRZONb --> HURIZONS). Examples: economics ) economic ) --> ecmmc *ecomonic ) *econrnic --> ecmc (repeated lml because of intervening vowel) retained more of vowel or sequence shorter candidate easy corrections rabbit 3 rabid D rapid ) - - > rbd repeat 3 Cand many o t h e r s } sociology *socialogy *sociolgy ) ) --> ) sclcy The a c c e s s k e y s f o r t h e d i c t i o n a r y are 5 o u n d e x c o d e s . Each Soundex code has a p o i n t e r t o t h e l i s t o f words w h i c h g i v e rise to i t . That i s , i t i s c o n s t r u c t e d as an i n v e r s i o n o f the examples above - the arrows p o i n t the o t h e r way: -7ZL- 6 Design and implementation ecmmc --> (economic, economics, .. ) rbd --> (rabbit, rabid, rapid, repeat, .. ) scLcy --> (sociology, .. ) 6.4.3 Finding the nearest match When a word in a user's search statement does not stem to anything in the index, the word is encoded using the same algorithm (given above) which was used to construct the dictionary. The code is Looked up in the dictionary. If it is not found, the procedure terminates. If it is found, the code's associated word List is scanned sequentialLy. Each word in the List is compared with the user's original word, and a 'matching score" calculated. If there is any word with a high enough score, the one with the best score is offered to the user. If there is more than one with the same score, the first is offered (effectively arbitrary). CAN'T FIND 'sociolgv' - closest match found is 'sociology* GREEN KEY to use 'sociology* instead I BLUE KEY to type a different word Rny word in the candidate list has the same initial letter as the user's word and a similar consonant structure, so we guessed that it might be unnecessary to take any further account of the order of the letters. We use a simple "anagram" technique and a word-length criterion. The minimum acceptable score is relatively higher for rather short words than it is for longer ones. The matching algorithm is given in Rppendix 2. Example: U s e r ' s word Lookup of RFFLUENCE RPPERLING RPPLIRNCE RPPLYIN5 •RPLIRNCE RBLMC encodes t o RBLMC returns a l i s t c o n s i s t i n g of which scores A C4 l e t t e r s a p a r t from the f i r s t l e t t e r i n common w i t h u s e r ' s word) which scores 6 which scores 7 which scores 4 of RPPLIRNCE has the h i g h e s t score and l e n g t h w i t h i n one l e t t e r the u s e r ' s w o r d , so t h i s i s suggested as a replacement. C N o t e t h a t RPPERLlNCi w o u l d h a v e b e e n o f f e r e d i f RPPL1RNCE had n o t been i n t h e d i c t i o n a r y . This i l l u s t r a t e s the i m p o r t a n c e of h a v i n g a d i c t i o n a r y of adequate s i z e . 3 -75- 8 Design and impL ementation 6.4.4 Discussion of the spelling correction technique USER INTERACTION See Chapter 7, Figs 7.8 and 7.6 for the screen Layouts. Words whose stems are not found in the catalogue's index fall into Cat least} three categories. They may be correct or incorrect, and, if incorrect, the corrected word may or may not be in the catalogue. Hence the procedure is not offered to the user as spelling correction, but rather as the neutral 'CPN'T FIND "" - closest match found is - "'. It would sometimes be presumptuous to say 'CPN'T FIND " i - do you mean B •' and out of the question Cwith our methods} ever to make an automatic substitution. One of our objects was to avoid having to process searches which contain any words which are not found. Ignoring the word can result in the retrieval of rubbish. Implicit HNU systems will simply return a failed search. In ranked output systems such as ours it is impossible to know what importance to attach to a "missing" word. Pnyway, a majority of such words are misspellings or miskeyings. It is essential that the user should know that a word is not found. Hence the system forces the user either to replace a missing word or to tell the computer to ignore it. Why not offer more than one word? If more than one word from the candidate list scores highly in the matching procedure it would seem desirable that they should all be offered. In an earlier version of Gkapi we experimented with a similar procedure for personal names (which is what the original Soundex scheme was intended for}. That version would offer up to nine possible matches arrayed neatly on the screen for selection by keying a single digit. However, we assume that whereas a user may genuinely not know how ^o spell a personal name, a large proportion of erroneous words are due to miskeyings rather than misconceptions about the spelling. Indeed, Dkapi '84 Logs show that personal names are more likely to be right than other words. P small sample of erroneous words from subject searches suggests that about 9 0 % look like miskeyings rather than misspellings. Simplicity Cfor the userD outweighed other considerations, so we decided only to offer at most one suggestion for replacement. Where more than one dictionary entry gives a high matching score, it would be sensible to offer the most frequent Cas being a priori more likely to be searched for 'J, but we did not have time to implement this. -76- 6 Design and impiementation Further; there is usually a single clear winner, and it is nasty computerese to offer a "choice" from a set of size one. The alternative Cselection by digits if more than one, our existing layout if only one) is worth trying, but might easily "throw" a user who has become used to using the green key to select the computer's suggestion and is now faced for the first time with the multiple choice screen. We carefully monitored all occurrences of the automatic correction during the first week of live use, before data collection had started. In an appreciable proportion of occurrences a user sat for a long time staring at the screen and/or pressed the red or black keys to abort the search even though the suggested correction was good. This reinforced our feeling that it would be unwise to offer a choice of corrections. 6.5 Search processing and term combination Between the user's entry of a search and the system's display of the result, the following steps are carried out: 1 User's input is preprocessed. 5ystem displays Your search 'electrical safety standards' Looking up these words 2 Lookup of weak and strong stems in the index, referring back to user if any not found. System displays each word with the number of books Cif any} indexed under the word's weak stem. Several other messages are possible, the most frequent being CRN'T FIND ' (word) ' When everything has been Looked up, including strong stems if this is the EXP catalogue, the system decides which of the items are going to be used in the merge Cbelow). 3 4 5 Pssignment of weights to terms. Calculation of "good - and "acceptable" weights for record retrieval. "Merging" of the posting lists for the terms to find records which reach an acceptable weight. Preprocessing and index Lookup 6.5.1 The user's search statement is disassembled into words, each word is weak-stemmed, and the statement reassembled. Each component is looked up in the weak stem index unless it is in the stop list. What constitutes a component is actually determined by the index lookup - in the £XP system 6 Design and impLementation this may be a word stem or a phrase from the go I see list; in CTL it is always a stem; in OSTEM it is always a "raw" word. Rny component which is not found, or which is found but has no postings Cthe latter case can only occur in EXP with the few go I see terms which do not occur in the source file) is negotiated with the user, who must replace it, tell the computer to ignore it, or terminate the search. In the EXP system each weak stem which is not in the go/see list is also strong-stemmed, and the strong stem is Looked up in the index. Each strong stem is marked as being semantically equivalent to its corresponding weak stem. Rt this stage the system has a list of components with a Cnon-zero3 number of postings for each component. It knows whether an item is a strong or a weak stem, and, if it is a strong stem, which if any of the weak stems it is equivalent to. No component is included more than once, and no account is taken of word order - except in the case of phrases in the go/see list. Example: ELECTRICAL 5RFETY STANDRRDS FOR ELECTRIC FIRES produces 1 2 3 4 5 t h e weak stems postings) postings) postings) postings) postings) ELECTRICRL C573 5RFETI C2B2 5TRNDRRD ( 1 > C565 ELECTRIC C421 FIRE C141 and t h e s t r o n g 6 7 8 9 stems postings) postings) postings) postings) equivalent equivalent equivalent equivalent t o 1 and 4 to 2 to 3 to 5 ELECTR (888 5RFETI C262 5TRNDRRD < 1 > (579 FIRE C141 (1) The ved< stem STANDARD arises from 5TflNDflRD and STANDARDS, but the strong stem also includes STANDARDISATION. The p u r p o s e o f t h e e q u i v a l e n c e s i s t o p r e v e n t a c o n c e p t c o n t r i b u t i n g twice to the weight of a retrieved record. We d o n ' t want the s e a r c h i n the example t o r e t r i e v e a r e c o r d M U R I N E 5 R F E T Y 5 T H N D P R D 5 RND S T P N D R R D I Z R T I O N , a s i t probably w o u l d i f b o t h t h e s t r o n g a n d weak s t e m s "5T0NDPRD" contributed to i t s weight. On t h e o t h e r h a n d , i f the user has e n t e r e d t w o m o r p h o l o g i c a l L y s i m i l a r w o r d s CELECTRICPL a n d ELECTRIC o n t h e e x a m p l e ) we d o n o t c o u n t t h e s e a s equiv a l e n t a l t h o u g h t h e y h a v e t h e same s t r o n g Cbut n o t weak) stems. We d i d g i v e s o m e c o n s i d e r a t i o n t o m a r k i n g a n y two -7H- 6 Design and implementation terms with the same strong stem as equivalent, but decided that this would introduce too much fuzziness: in the search CUMMUNICR7I0N IN COMMUNIST SOCIETIES, C0MMUN1CRTI0N and COMMUNISM would be treated as equivalent because their strong stems are the same. Items indexed under all three terms would not be ranked any higher than items indexed under only two. 6-5.2 Rssignment of term weights Each component is given a weight which is the largest integer not greater than LogCN/nD, where n is the number of postings for the component. N is a constant which is related to the number of records in the bibliographic file. It must be at least as great as the number of postings for the commonest 'term in the index. The logarithm is taken to base 2 Ctheoretically it doesn't matter what base is used, provided the weights are stored with enough precision, but since we store them as one-byte integers, base 2 gives a reasonable spread of weights with minimal arithmetic}. Example: For the systems used in this project with the current PCL catalogue the weight constant N is set to 3276B (this being comfortably more than the number of postings for the commonest term in the index). Since 2 1 K = 32768, Log(32768) is 15, and the weight of a term with n postings is 15 - log(n) Grounded down if necessary). Thus the weights are term 1 2 3 4 5 ELECTRICRL 5RFETI 5TPNDRRD ELECTRIC FIRE pstgs 573 262 565 421 141 weight 15 - 5 = 6 15 - B = 7 6 7 B and for the strong stems 6 7 6 9 ELECTR SRFETI 5TPND0RD FIRE 688 262 579 141 6 7 6 6 For a theoretical basis for this weighting scheme, see L113. It is said to give the best probabilistic approach to ranked output in the absence of any relevance information Cor in the absence of any knowledge about the relative importance of the terms). R term which occurred in every record would be useless for discriminating between relevant and non-relevant records, so this should have zero -73- 6 Design and implementation or very Low weight. Gur weight function satisfies this criterion Cif N is set to the number of records in the file). It is not always true that rare terms are more important than common ones, but at least if records containing rare terms are presented first it doesn't take too long to look at them and get them out of the way. Rn example of a type of search where the rarest term is by no means the most important is LEAST 5GURRES ESTIMATORS. •Estimators" has only two postings in the PCL file, although its strong stem has several hundred. The problem with this search is that it comprises a single concept, and so should be treated as a phrase, but the same concept could be, and is, expressed in relevant records as LERST 5QURRE5 METHOD or just LERST 5QUARES. 6.5.3 Calculation of 'good' and 'acceptable' weights for record retrieval During the merge Csee below}, the weight of a record is determined by the weights of the terms which it has in common with the search. The minimum acceptable weight CMflW/D is the threshold weight for a record, below which it will not be retrieved Calthough it must be indexed under at least one of the terms of the search, otherwise it would not be considered at alt). The minimum good weight CMGUO is the least weight at which a record will be considered as a reasonable match with the user's search. MRW and MGUJ are! calculated using the maximum possible weight CMPU/D . MPW is the weight which a record would have if it contained all the terms of the search. It is the sum of the weights of all the weak stems in the search. In the example above these add up to 33. The method used to calculate MRW and MGW depends on the number of terms in the search. Searches are treated differently depending on whether they contain one term, two terms or more than two terms. When there are only one or two terms the actual function used for the weighting is almost irrelevant, provided that strong stems have lower weight than weak stems. SINGLE TERM SERRCHE5 For CTL, MRW = MGW = MPW = the weight of the CsingleD weak stem. For EXH, MGW as CTL; MPU/ = weight of the strong stem. R large proportion of single term searches are for proper names, where the strong and weak stems are generally identical. -An- 6 Design and implementation Example: INTEGRALS (EXP system] retrieves 46 books under INTEGRRLCS), followed, if the user wishes, by another 230 under INTEGRATION, INTEGRATING, INTEGRATED and other words which give the strong stem INTEGR. TWO-TERM SERRCHE5 This is the most frequent number of terms in a search. In the absence of semantic knowledge, it is only the notions •common" and 'rare 1 , together with the ability to differentiate between strong and weak stems, which are needed to rank the records for output. TWO COMMON TERMS Only records containing both will be retrieved. terms Cor their strong stems) MRW = sum of the weights of the strong stems. MbW = sum of the weights of the weak stems. CIn the CTL, MbW = MRW . 3 Example: INDUSTRIRL 50CIETY ONE COMMON TERM, ONE RRRE Rll records containing Cthe strong stem oil are ret rieved. the rare term MRW = weight of strong stem-CEXP3 or weak stem CCTLD of^the rare term. MGw" = sum of weights of strong stems CEXPD or weak stems CCTL3. Rll records which contain the rare term are retrieved. Example: HISTORY OF SWORDS TWO RRRE TERM5 Rll records containing Cthe strong stem of) either term are retrieved. MRW = weight of strong Cresp weak} stem of commoner term. M&w1 = sum of weights of strong Cresp weak} stems. Example: YRCHTING RND BOOTING -M1- 6 Design and implementation MORE THAN TWU TERMS Records wilt usually be retrieved if they contain Cstems of) about two-thirds of the terms in the search. MPIU = half the maximum possible weight, MGW = two-thirds of the maximum possible weight. Example: 50CIRL 5TRRTIFICPTI0N OND 0CCUPRTI0N5 The f r e q u e n c i e s are term 1 2 3 * and w e i g h t s ( s t r o n g stem w e i g h t s not given} pstgs 6257 46 100 weight 15 15 15 12 * 3 5 = 10 6 = 9 SOCIRL 5TRRTIFICRTI0N 0CCUPRTI0N5 MPW = 3 + 10 • 9 = 22 MW =11 O ( h a l f of 22) M W = 14 G ( t w o - t h i r d s of 22) Thus a l l r e c o r d s c o n t a i n i n g any two o f t h e t h r e e t e r m s w i l l be r e t r i e v e d . The s t r o n g s t e m w e i g h t s a r e n o t g i v e n h e r e , but i n t h i s example t h e r e t r i e v e d set would p r o b a b l y i n c l u d e r e c o r d s i n d e x e d u n d e r any two o f t h e s t r o n g stems alone. I n t h e PCL f i l e t h e r e a r e two r e c o r d s i n d e x e d u n d e r a l t t h r e e w o r d s , 41 more u n d e r 50CIRL and STRRTIFICRTION, and a f u r t h e r 13 u n d e r 50CIRL and OCCUPATIONS. ( B o t h t h e r e c o r d s under 5TRRTIFICRTI0N and 0CCUPRTI0N5 a l s o c o n t a i n S O C I R L . ) When t h e merge ( s e e b e l o w ) is complete the user w i l l see 2 books match your s e a r c h e x a c t l y (56 books found a l t o g e t h e r ) 6.5.4 Merging the posting Lists T h e p o s t i n g s L i s t s i n t h e i n d e x are ordered. They c a n be t h o u g h t o f as b e i n g i n document number o r d e r . In the p r e s e n t i m p l e m e n t a t i o n " d o c u m e n t n u m b e r s ' are really disk addresses. W h a t e v e r t h e y are t h e y must be o r d e r e d , otherw i s e t h e merge w o u l d be t o o i n e f f i c i e n t , and t h e y must be able to t e l l the system where to f e t c h the records from for display to the user. U p t o 1 6 p o s t i n g s l i s t s are merged i n t o a s i n g l e output l i s t . P) l i s t r e p r e s e n t s a s i n g l e weak or s t r i n g s t e m . The l i s t s are numbered 1 t o 16. Each l i s t has a w e i g h t attached to i t , a n d p o s s i b l y a l i s t o f t h e n u m b e r s of other -62- 6 Design and implementation Lists to which it is semanticalLy equivalent. While there is some list which is not finished, the merge finds the "smallest" posting not yet considered. It sums the weights of each List in which this posting occurs, omitting weights for those lists which are marked as equivalent to another list which contains this posting. If the total weight for the posting is at Least HPW (minimum acceptable weight), the posting is copied to the output list. Thus the merge ends with a list containing the addresses of all the records which contain enough of the components of the search to reach MPw\ In our impLementation the output list is restricted to 512 postings, but aLl postings in the input lists are considered: if the output List becomes fuLL new postings repLace postings aLready in the output List if their weight exceeds that of the posting with Lowest weight in the existing output' List. This makes the process more complicated than it need be, but we were working with computers with a very Limited amount of core memory. It is not a constraint to the user: 512 records is comfortably above what most users wish to see, and the List is guaranteed to contain the "best1 records. Rs soon as the merge is finished, the output List is sorted by weight so that the records with the highest weight will appear first . 6.B The bibliographic file This is very similar to the one used for Gkapi '84, which is fully described in Chapter 4 and Appendixes 1 to 4 of the first Gkapi report [123. To try to add a bit mure subject information the present file includes additionally MWRL 651, and subfields $x Csubject or form subdivision) and $z Cplace subdivision] of 651) and 651. We also intended to use 505 (contents notes), which had only been used for records with analytical entries in the previous file, but this is so rarely used in the PCL file that it was not worth the overhead of an empty field in nearly every record. Nevertheless, by British standards the PCL records are comparatively rich in subject content as a Large proportion of them contain verbal feature headings CMORC 0 8 3 ) , PRECIS headings and LC5H. Inevitably, there is a good deal of duplication of headings, but most records look reasonable when displayed. The extra subject information increased the average record length considerably. To compensate for this we no longer store copy numbers, but only a count of the number of copies at each site. -B3- 6 Design and implementation There are about 98,000 records in the file-. 6.7 The subject index Index structure and storage are very similar to that described in 5.7 of C12J. Since there are only four data types in the index Cweak stems, strong stems, entire and truncated Dewey numbers) the structure has been simplified a little. 6.7.1 Indexing and the go/see list The go/see list is used during indexing. When index terms are being extracted from bibliographic records the field being indexed is matched against the List before the norma process of word extraction is performed. If it contains a phrase or word from the list a token representing this go I see entry is output. For example, "United States" produces the same? token as "USR". When the final index is being produced the actual entries Ce.g. "United States"3 are read in at the right place in the alphabetic sequence, and pointed at the list of postings for the appropriate token. The end result is that searches for "United States and for "U5W" return the same posting list. Thus the go/see list is no longer explicitly used after the file ha been indexed; it is, in effect, incorporated in the index. (The system does not "know" at search time which terms are included in an equivalence class. But it does know when i has found something which is in the go I see List. It can only inform the user that "UNITED STRTES" includes "U5H" unless both these terms occur in the same search (7.4.23. 6.7.2 Source fields The index is generated from all title-Like fields subject headings and verbal feature headings corporate and conference names Cboth author and subject the go I see 6.7.3 Index list. and size contents The index used by EXP and CTL contains weak and strong stems of every word in the source fields. Hyphenated words contribute both "concatenated pairs" and their separate constituents C"non-proliferation" gives rise to "non" and "proliferation" and "nonproliferation", as well as their strong stems "prolif" and "nonprolif"D. -84- 6 Design and implementation •Initialisms" are processed so that they become words C"USR" = - U.5.P." = "U 5 P"D. Every entry from the go I see index C6.7.13. List, weak stemmed, is in the The index also contains Dewey numbers and truncated Dewey numbers, but these were not used in the experiments described in this report. Without Dewey numbers, the mean number of index terms per bibliographic record is about 24. There are about 61,500 distinct stems. For the majority of words, the strong stem is the same as the weak stem. When this is the case they are not stored separately, but merely flagged as being both strong and weak stems. 6.8 Storage requirements Bibliographic file: 20 megabytes Stem index Cused by EXP and C T L ) : 5.7 megabytes Cexcluding Dewey numbers, which occupy about another megabyte} Word index Cused by 05TEM') : 4.4 megabytes Spelling dictionary and its suundex code index: 0.6 megabytes 6.9 The Okapi '86 programs It is often true that the more simple a system appears to the user, the more complex it needs to be "behind the screen". We would not have been able to undertake this research if we had not had the Okapi '84 system to build on. Several person-years of design and programming work had gone into this. This meant that, for this project, we did not have to spend much time on file structure and index Lookup, or on record displays. The programs which deal with the formatting and sequencing of record displays consist of some 2000 Lines of code; this is largely unchanged from the way Gill Venner wrote it three years ago. However, a very considerable amount of new design and programming had to be done . Like previous Okapi systems most of the programs ar& written in Z80 assembly Language. The programs for reading MPRC tapes and for selecting and stripping the records are written in COBOL and DEC assembly Language. There seems to be a tendency for the complexity of programs to vary inversely with the outward simplicity and openness of the system. Why this is so is illustrated by the trivial example of a program which allows a user to enter dates in free formats, such as 1/3/6/, 1st March 138/, -85- 6 D e s i g n and implementation 1.iii.87. Such a program is about an order of magnitude more complicated than one which rejects a date if it is not entered in some "standard" form Like 01-03-1987. Until the advent of computing for the general public it was not usually worth while to write programs like this. It would be out of place here to give a detailed desription of the internal workings of Okapi '86, but it does suffer from the type of interactional complexity illustrated above Rn example is the program which takes control while the user's search is being parsed and its terms looked up. This program has to handle the various combinations of messages which can appear on the "searching" screen CFig 7.5 etcJ while it is working, and maintains links between what the user typed and the stems which are being looked up. It appears s/ery simple, yet it contains about 1700 lines of code at the top level Cand several times as much again at lower levels which deal with index lookup etc3. It is controlled by three decision tables, the Largest of which has to check five conditions Cis this a phrase or a word, did it come from the go/see list, has it any postings, have we had it before in this search, and, if it has occurred before, did it arise from the same word or words in the user's search statement?); depending on which of the conditions are true is has to perform various combinations of seven actions Cstore the result of this termsearch, display "N books under ...", display "... included under ... " , perform strong stemming etc3. For each term in the search there are, at the top level, about 40 different paths which the program can follow. Even with this degree of complexity there are "Loose ends" Csurprisingly, we have not come across any actual mistakes); a few rare combinations of conditions are not properly dealt with. References 1 ULM5CHNEIDER J E and DG5ZK0CS 1 E. 0 practical stemming algorithm for online search assistance. Online Review 7 C4D, Ougust 1983, 301-315. 2 MI5CH0 W. Library of Congress Subject Headings: a review of the problems, and prospects for improved subject a c c e s s . Cataloging & Classification Quarterly 1 C2/3D, 1982, 105-124. 3 5CHRB05 R H. P o s t c o o r d m a t e retrieval two i n d e x i n g for Information L a n g u a g e s . Journal Science 33 C D , : a comparison of Society of the Qmerican 1982, 3 2 - 3 7 . -86- 6 Design and implementation 4 MITEV N N and WRLKER S. Intelligent retrieval aids in an online public access catalogue : automatic intelligent search sequencing. In: Informatics intelligent retrieval. Proceedings 8: of Rdvances in an RsliblBCS conference. 198b. Oxford, 16-17 Rpril 1985. London : Rslib, 5 P - I E C D. lnf ormat ion ()C Retrieval, and the Computer . London : MacDonald and Jane's, 1977. 6 PORTER M F. Hn algorithm for suffix stripping. 14 C33, 1980, 130-137. Program 7 LENNDN M. and others. Rn evaluation of some conflation algorithms for Information Retrieval. Journal of Information Science 3, 1981, 177-183. 8 FRRKE5 W B. Term conflation for information retrieval. In: Research and development in lnf ormation Retrieval : proceedings of the third joint BCS and PCM symposium King's College, Cambridge, 2-6 July 1984. Edited by C J van Kijsbergen. Cambridge University Press on betialf the British Computer 5ociety, 1985, 383-389. of 9 TH5LIRC0ZZ0 R, KOCHEN M and ROSENBERG L. Orthographic error patterns of author names in catalog searches. Journal of Library Rutomation 3 C23, June 1970, 93-101. 10 POLLOCK J J and ZRM0RR R. Collection and characterization of spelling errors in scientific and scholarly text. Journal of the (American Society for Information Science 34 C1J, January 1983, 51-58. 11 CROFT W B and HRRPER D J. Using probabilistic models'of document retrieval without relevance information. Journal of Documentation 3 5 C43, D e c 1979, 2 8 5 - 2 9 5 . 12 MIFEV N N, VENNER G M and WRLKER 5. Designing public access catalogue : Okapi, a catalogue an on a online local area network. CLibrary and Information Research Report 333. London : British Library, 1985. -87-