1 Intnoducitxan and b a c k g r o u n d 1.1 The project proposal The research proposal BImproving access in an online public access catalogue by automatic word stemming, spelling correction and limited synonym generation" was approved by the British Library Research and Development Department CBLRDDD and funded under grant 5I/G/720. It proposed to investigate the following recall-improvement devices: automatic word stemming synonym and cross reference tables 5oundex-type keys for matching personal names an n-gram technique for approximate matching of words These were to be applied within a version of the Okapi online catalogue developed under an earlier project. These devices are not, of course, new. They all have a long history, which is summarised in Chapters 3 - 5 . 5ome of our techniques are probably new, but the main emphasis in the design of the experimental systems was on ways of applying and presenting the devices in an online catalogue for general users. 1.2 Motivation Rn online catalogue is a bibliographic reference retrieval system. It differs from the "traditional" reference retrieval systems such as Lockheed's DIOLOG in several ways : searches Bre done by untrained end-users rather than by or with the help of intermediaries the subject coverage is often wide the subject description in the records is inadequate or even absent. -1- 1 Introduction and background In traditional information retrieval CIR3 intermediaries use truncation and other "fuzzy matching". Inexperienced and casual users will not generally do this, so a system designed for such users should automatically carry out some of these procedures. There are many cases where a search intermediary would expand a searcher's query to include terms which cannot be obtained by a simple application of grammatical rules and/or recourse to lookup tables. The skills required to do this involve the use of subject knowledge, linguistic knowledge and knowledge of the fileCsD being searched, and cannot be automated within the present capabilities of linguistic computing. However, the automatic inclusion of morphologically related search terms and some tolerance of misspellings should be quite feasible with conventional hardware and software techniques. 1.2.1 Feasibility study The likely utility of automatic query expansion had been investigated by repeating subject searches from the transaction Logs of the original Okapi system Csee 1.43. More than a quarter of the searches would have benefitted from a simple stemming procedure Cconflating singular and plural noun forms, "ing", "ed" and "s" verbal endings etc.3. Occasionally, there would have been a decrease in precision. Rbout a tenth of the searches contained apparently unnoticed spelling mistakes. 1.3 Staffing The project head was Neil McLean, Head of Library Services at the Polytechnic of Central London CPCL3. The proposer and director of the research was Stephen Walker. Richard Jones was appointed as Research Officer in September 1985, and Nicola Johns as Research Officer/Programmer in June 1366. 1.4 Environment The project aimed to build on the work of Mitev, Venner and Walker who designed and developed the experimental online catalogue system, Okapi, under the BLRDD and Department of Trade and Industry funded project "Microcomputer networking in Libraries". The work of this project, proposed by Neil McLean and Mel Collier, is reported in [13. Other publications include [2,3,4,5,6,73. The original Okapi system will be referred to here as Okapi '84. It is described in [13. Okapi '64 was an implementation of an online catalogue search system on a local area 1 X o t r o d o c t i ® © ) mmd b e o k g r o o o d ©etyork C b e e f e r PdPb! 40003 ©f (RlppL® 21® ^ i e r © c o m p u t e r g Q derived fr®X© (nfD©§t © f tb© (©®n©gr©ph It ©©©©©©©d @ f i l d P©C©P(QJS in PCd'e ( ) c £ n @ reedebt© (£grfc©(L®pj©o The ©impheei© in Okepi m©h() 5 ©4 y?@© @o providing ®©^y ini©r©©ti©n ~ yeebiliiy ®t sight - ©©(©bioed yiib ©in ©ff@©t£v©n@^@ ©t teeef c&(AnpDr©(b(L© ^ith ©tber g®©sr©b ©yeietneo Okepi ^©4 ^§i in§t©ll@d in the Riding booe© Street l£br©rvb bb© (L©P(|p§t ©f the polyieobni©^© ^it© tibr©ri©efl ©b©r© op t© Poor ter©oln©t© y©r© i© didly t^© T@r neerLy 4 © O V ^ ^ ^ ^ O fh© ©one b©r©1y©r© ©od tb© ©©©)© ieet ©it© y©r© ©eed Tor tb© ©yrrenf pr©j©©t0 Tb© bibt£®gr©pbi© tit© ©eed for tb© ©orrent projeof i© very ©i©bl©© in ©iryofyr© t© tb©t d©e©ribed i© Chapter 4 ©f EU3o Tb© ©nly ©yb©i©nii©t bifferen©© i© tb©t bibrery ©f Gongree© Sybj@©t Pending CbGSP> foray ©ont©nt ©©b geogrepbioel ©ybdivieion© ©re £©©tyd©bo Severel generetion© ®f ©eeroh ©j©d indexing ©@ffy©r© yer© ©@©efroofed doring tb© proj©©t 0 boob of tb© Okepi ^84 ©e©r©b ©yetem y©e ©o©pl©f©ly r©d©©igned ©©d rewritten5 ©Ltb©ygb ©LL tb© progrera© rely b©©vily ©n tb© ©ubetentiel librery ©f utility ©yb©@yti©©© ybioh y©e yritien by Stephen (belker ©©b Bill b©o©©p for tb© ©©©tier proj©©t 0 Of xhm higher level ©ode^ th© record forgetting ©©d diepley routine© p©©)©in lergely y©©b©©g©do 1 1 Pi©t©©i©©t ©©©©©yy ®T th© p©®J©©t Be©©y©e ©f ©horieg© of time ©© th© ©erlier project Little fortnel evetyefio© of 0k©pi ^84 h©d b©e© d@©©o Treoeection log dot© yes ©©tl©©t©d continuously fro© 0k©pi S84 yntiL bey ISBBo Much of th© ©erty york @© tb© cyrrint project involved e^eminefioo ©nd ©©©Ly©i© ©f th© (L®g d©t© in order t© fore© © pregmefi© beei© tor ©oy ney ©eerch ©yeter© We yrofe ©everel vereioo© ®f © Sounder-type procedure5 b©th for pereooet ©©©)©© ©nd f®r t©^ty©rd©0 ©©d [©©de i©f®rfnni©L ©©mperieon© of their ©ffi©e©y0 The preeedyre ©©enfyy' y ©heee© i© d®yyme©f©d in So4 0 Z ©nd PppendiK 10 P ©Lightly ©©dified i©jple©ent©ti©n ©f berfin Porter ^e ©utf iy-etripping pr©©©dur© CBoZo? - SoZottb y©e tee ted0 Doth ©f thee© devioe© y©r© in©@rp©r©t©d in ©n 0int©r©edi©t©a Okepi ©yet©©) yhieh y©e b©o©©netr©ted ©t tb© ©@nf©r©n©e °0nlin© PybLi© 0©©©e© t® bibr©ry FiL©e° ©rg©©i©ed hy the Centre for C©f©° Logy© Peeeereh ©t l©th Uniyereify in PpriL t9©Bo The inf©r©L©bi©fe ©y©fe©u ©onfeined © ©i&©plified ©nd probebLy i©pr@©ed ©peoifi© ite© ©e©r©b fyn©ti©n0 Pother ©od title ind©© diepLeye^ yhioh yere provided ynder ©©(©© ©iroymeten©©© in Okepi °&4 but y©r© r©rety y©edfl yere repLeoed by ©epyenyed diepley^ ©f brief bibli@grephi© record©o The preoieion ©f the implicit ©othor/title ©oronyt© ©eeroh function y©e improved by velideting the reeulfing eet by ©ombining if yifh e yord from the title or the ©yfhor ©e entered by the yeer0 There ©ee m 5©ynd©x~ 1 Introduction and background type index of personal surnames, which could lead to the display of a selection of possible matches in the case of a failed personal name Look-up. The intermediate system was never developed to the point where it was robust and finished enough for public use Calthough it stood up fairly well to the attention of librarians at BathD. Its specific item search will probably be used for future public versions of Okapi, and the subject search forms the basis of the systems described in this report. The intermediate system is described in 171. The unfinished subject search facility of the intermediate system included some of the features used in the systems described in Chapters 6 and 7. The subject index consisted of words from titles, subtitles, subject headings and other subject-rich fields of the MRRC record, and also Dewey numbers. OIL words were subjected to the Porter algorithm, and were combined using a combinatorial CRND/0R3 technique with inverse term frequency weighting as described in 6.5. Records were output in decreasing weight order. There was a small cross-reference list enabling terms to be treated as synonymous: the list included CHILD and CHILDREN and some other irregular plurals and alternative spellings. Richard Jones carried out a substantial analysis of failed searches, some of which was published in C8D. We constructed an inverted file of some 6000 subject searches submitted to Qkapi '84 and used it to obtain a collection of spelling mistakes and to suggest entries for an automatic cross-reference table CChapter 6 and Rppendix 53. During the Summer and Rutumn of 1586 we designed and wrote new subject search and indexing programs. Two versions of the search program were produced. One - the •experimental" system Creferred to as EXP1 - incorporates full two-Level stemming, a substantial look-up table of phrases and equivalence classes of related terms and a spelling correction procedure. The other program - the "control" or CTL system - has "weak" stemming but no look-up table and no spelling correction. It is unlikely that users would notice any difference between the two programs unless a search carried out on one was immediately repeated on the other. There is also a third system, called 05TEM, which incorporates none of the recaLL-improvement devices. This was only used for the repetition of users' searches by the experimenters. -4- 1 Introduction and background In November Okapi '86 Csubject search only} was installed at two terminals in the Polytechnic's Riding House Street library. The systems were alternated daily between the two terminals. After a trial period, during which the system was slightly improved, we began to collect data from Live use. Rbout 120 people were interviewed after they had performed subject searches, and full transaction log data was gathered from about 1100 searches Cabout 600 sessions}. Rfter formal data collection had finished, the EXP system was Left on site to collect further transaction log data. Rt the time of writing we have collected Logs of some 7700 searches, and have done a certain amount of analysis of this data where necessary to supplement the original set. In parallel with the work outlined above, we were also designing and developing programs for use in a second project Con the use of relevance feedback in online catalogues - 5I/G/7653. While data analysis was being carried out, Nicky Johns resigned. 5ince it would not have been possible to obtain a suitable replacement on what would have been a very short contract, we suspended work on the relevance feedback project so as to be able to complete the present project as quickly as possible. 1.6 The report Chapter 2 gives a brief survey of some of the probLems in subject searching in online catalogues. Chapters 3, 4 and 5 survey ways of implementing the recalI-improvement devices in IR systems in general as well as in catalogues. The catalogues which we designed for this project are described in Chapters 6 and 7. Chapter 6 describes them from the inside and Chapter 7 from the outside - how the user sees them. These two chapters are interdependent. Their logical order may be 6 followed by 7, but their psychological order may be the reverse. Chapters 8 and 9 cover data collection, evaluation, and conclusions. References 1 MITEV N N, VENNER 6 M and public access catalogue ; area network (Library and 393. London : the British WRLKER 5. Designing an online Okapi, a catalogue on a LocaL Information Research Report Library, 1985. results -5- D f o i ^ o d y e t i o o ©nd fe©efe|p©yn)d ( H E N N ©nd) (1PLKEK So I n t e l (Lig©nd r©tei©v©t ©id)<§ i n en M T V ©nlin© p y b L i n @ee§§§ n©tel®gy© § ©yt®(n©di© i n t e l l i g e n t ©©©rdh g©gy©nningo Ins Itnif®^n)©di©§) © § PbvcDnnd^ i n I n t e l ligent fefpi§^§lo dE@n©©dingg ©f ©n /3§LifedICE Cnnf ©Ptoee< © n t e t e D®°D7 % p i i DD©§0 London g ( t e l i b 0 138S0 l€ER So T t e f(p©@ l©ngy©g© ©ppr@©dh d® ©nilin© ©©t©~ i^gffigo itng te^nte £©d©(l@g/y©§ ©nd d t o E^©© L©fDpj©gp E)pp^©©nbo E d i t e d by E b i l i p i b y g n d 0 Enivdtpgity ®ff 3©dhn Libn©ny 0 4 iblfEd b M0 mmZU S ©nd SRLKER So 0KPF1 g ©n ©nlin© 5 p y b l i n geet§§ e©d©l®gy© ©n a ondL©p®n©jppyd©(r 1®©©1 ©r©© nxib^yptka Eng ©nibn© F y d i i © F)n©©§§ te Lifep©f^ E i t e m (^©©©©dingg ®d © tentfp© f®(? C©d©t®f}y© E©s©©p©d C @ n f g ^ § ^ i b@lb ©d §©db Eniv©ngid^ D°§ S©pd©nte(f> UEEdb Edid©d bv (R)l©n S ©L O k t e t e § Et©©vi©[rp DSBSo @ (c 5 MSIEb N dfio lb© y§@p intend©©© i n ®n ©nlin© p y b l i © i e e e s i y©t©l®gy©o Ins £ ( iD ye ( Rm^imt^d I n f ®^n©tten) K § t p i t v D i 0 ®?/ p t ©p IR2IR0 BS o Int©po©te@n©l Synpogiyn © ^ © o i g ^ d b > d t e C©odp© y d© (nJdiytdg Etyb©^ d ^ I o f n n n ^ t i p y © B©©yn©nt©te© ©nd tb© I n s t a t e d d B2fn)f@^m@^i.qu@ ©d d© ffedtea©diq)y©^ SJpptipy^d'g dm Btr>(B(n©bl@ o H&~2@ b t e ^ n b DEES 0 S^dnntot©, E^©o©©o P©pim % CMRS-IdSe, 8 dEdMEE G 'JdELKEE S ©ndl MITEV No Ok®pi g © ppndotyp© © ©nlin© ©©4©l®gy©0 dlWE S3, J y l y 1SBS0 3 - 1 3 . 7 MEEKER So 0KEP1 % ©v©ly©ding ©nd ©nteneing ©n ©Mp©rii©©nd©l ©nLind» c©t©L@gy@o Lite©©!/ I^©nd^ a 138? € t e b©'pybli©b©d3 a Spring B JONES R Inpr@ving O t e p i § te©n©©ndi®n L©g ©n©ly©i© ®d E b©il©d ^©©pdhxtm i n ©n ©nlin© n©d©l®gu©c E2ME 32 ^ H©y