2 S u b j e c t sear-czil-iiLng p r o b L e m s 2.1 Introduction In most online catalogues an unacceptable proportion of subject searches fail to find anything at all. Markey C1, p833 gives proportions ranging from 35% to 57% taken from studies of four catalogues in the United 5tates. One of the Final Reports arising from the CLR (Council for Library Resources) study [2, p133 recommended that users' subject search terms should be automatically truncated if there were no retrievals. The same report recommends that spelling correction should be applied in known-item searching. This applies at least equally to subject searching COkapi transaction logs suggest that spelling and keying mistakes BrB more serious in subject searching than in known-itemD. The same CLR report also urges the provision of "cross-references online" and "related word lists to lead users to more general termCsD". The main problem in online access is that of matching the user's search to the various ways in which the sought objects may be described in the database. First, many searches contain misspellings or miskeyings. Since this is often unnoticed by the user, who assumes that the sought subject is not covered, the retrieval system should help the user to correct misspellings. Then, even when all the words of the search QFB correct they may not correspond to the language of the catalogue. Searches can be broadened by automatic stemming of their constituent words and by automatic cross-referencing. It is these three devices - spelling correction, stemming and the use of cross-reference tables - which are the subjects of this report. Rdditionally, in keyword-type catalogues, many searches fail to find anything because not all of the words of the search co-occur in any record, even after they have been corrected and stemmed and cross-references have been drawn in. This does not mean that there are no relevant records. Often it is due to the inclusion in a search of "inappropriate" words such as expressions of time CWOMEN'5 WORK BETWEEN THE WRR5D or of scope CCRITICISM5 OF FRRNK PRRKIND. 5ome, but by no means all, of these terminological problems can be alleviated by using a fairly large stop list. CRITICISM cannot be stopped, BETWEEN, UNTIL etc probably -7- 2 Subject searching problems can be stopped. (Whether dates should be stopped is an interesting question which there is no room to discuss here}. What is clear is that online catalogue subject search systems must be able to retrieve records which contain only some of the unstopped words of a search. We refer to techniques for achieving this under the general name "combinatorial searching - . On evaluation of combinatorial searching was not one of the objects of the present work - we regard it as an essential feature of any keyword search system for untrained users. Nevertheless there are many references to it in this report because our application of some of the other devices is closely connected with Okapi's method of term combination. 2.2 What users bring to the catalogue The majority of user's subject search statements as recorded in system logs are straightforward and comprehensible. Most of them are concise noun phrases - many are simple one-word concepts. Sometimes they contain synonyms or related terms which are intended to be treated as alternatives. There are quite a Lot of dubious spellings. 5ome searches are incomplete or incoherent, and there is always a proportion where the user is "fooling11 or "playing" or simply not concentrating. The searches listed in Table 2.1 were selected systematically from logs of Okapi '84 Cevery twentieth search starting at a randomly chosen page3. They were all submitted as searches for "books about something", and in response to the prompt: The computer will look for book(s) described by as many as possible of the word(s) you type. Please enter word(s) or a short phrase which describe your subject: It needs to be pointed out that only about half of them were the first search in a session. "Independent", for example, was part of the sequence "itn" [independent television n e w s ] , "independent", "entertainment", "radio one". Readers of this report who have access to an online catalogue accessing an undergraduate level collection covering social studies, communication, economics and business studies are invited to repeat these searches, with or without the misspellings. 2 Subject searching problems Table 2.1 R sample of subject searches 1 Popper 2 behavior teddyboys subculture 3 anything by Frank Parkin 4 raddio 5 tetnology influence on structure 6 drug abuse treatment of drug dependents 7 resource mineral depleation 8 central intelligence agency cia media press news propaganda tv radio 9 hollernzolleran [probably for Hohenzollern] 10 female sexuality 11 consumer decision making models 12 Rees case 13 independent 14 underdevelopment 15 machiavelli 16 modernism 17 Capital radio 18 photo;nature—nude 19 photography 20 education welfare 21 traumatalogy 22 nationalised industries 23 communal ism patrimonalism 24 early development statistics movement 2.3 5ubject search facilities in current online catalogues 2.3.1 Phrase searching Some online catalogues offer subject access to an index of Library of Congress Subject Headings CLC5H3. The first result of a search is a display of headings Cnot bibliographic records) in the alphabetical region of the user's input. The user can browse alphabetically backwards and forwards, and can select records indexed under a chosen heading. Some of these "phrase access" catalogues return a failed search if the user's key does not find at least a partial match with a subject heading, but most of them always display something. 2.3.2 ProbLems with subject headings Research [3, 4] has demonstrated the inadequacy of LC5H; many headings lack specificity and the language used is often out of date. Rt least half of all searches fail to locate either a heading or a reference at the first attempt. If subsequent attempts are included this figure can rise to about 70% C53. The proportion of searches -9- 2 Subject searching problems which exactly match a subject heading is usually very low C25% is typical), but not all of these searches fail. Of the 24 searches in Table 2.1, five are near enough to a Library of Congress heading in the PCL catalogue to find at least one book by browsing headings; two more are near to PRECIS headings. Mandel and Herschman [63 suggest using feedback from user searches to incorporate more "see" references into the LC5H structure. This is certainly desirable, and it is the method we used for constructing the automatic crossreference table used in the versions of Okapi described in this report. However, users should not have to repeat their searches using the "preferred- form of a heading. On online catalogue can and must do this automatically. If our users think of the Department of Education and Science as DE5, then which is the •best1 term? CProvided we can prevent French "des" entering the index}. "See also" references are a different problem, and one which has not been seriously tackled yet. Some of the more recent commercially available systems do at least allow for their display and selection by line number. The LC5H problem is only partly one of language. Several studies have demonstrated that lack of specificity in indexing often causes searches to fail. Mandel and Herschman [63 point out that this is not always the fault of the content of LC5H but rather the consequence of poor indexing. 2.3.3 Keyword searching Most of the more recent catalogue systems use individual words rather than headings; a few offer both access methods Cbut how does the user know which method to choose?). The words may be taken from subject headings or from titles or both. Many European Libraries do not use subject headings. 5ubject access, if any, has been provided by printed indexes to classification schedules. When such libraries automate their catalogues they sometimes provide "keywordin-title" searching for subject access. It is likely that titles are a slightly richer source of subject-rich keywords than are subject headings, although users find subject headings useful for judging the likely relevance of a retrieved record. CHence subject headings should be included in the bibliographic display}. .m. 2 Subject searching problems Thirteen of the searches in Table 2.1 (numbers 1, 2, 4, 6, 7, 10, 11, 14; 15, 16, 20, 22, 23) were repeated on the current project's LXP system. R L find s m books through (stems of) L o e t i t l e words and eleven find s m via subject heading words. o e Title was a richer source in eight of the searches and subject in two, three being judged equal. The two searches which didn't work at a l l on subject headings are 'female sexuality1 and apatrimonialisma. I n almost a l l keyword systems, words are combined using an i m p l i c i t boolean QND. Up to a h a l f of the searches may f a i t i n s p i t e of the f a c t that i n d i v i d u a l words are f a r more L i k e l y to f i n d something ;than phrases, because the words do not a l l co-occur i n any record. Of the 24 searches i n Table 2 . 1 , 12 f i n d ,something when t h e i r words are RNDed. 5even of these are^single word searches. 2.3.4 Recess points Most MORC records are extremely poor in subject content. Marcia Bates has recently suggested C7] that LC5H would be adequate if clever linguistic and other preprocessing is applied to users' searches. We do not believe that this is the whole answer. In the long term the emphasis in cataloguing must be moved from physical description to subject description. Until this happens it is essential to use all MPRC fields which can contain subject information. Markey gives a list of subject-rich CU53 MRRC fields in [1, p1583. Titles tend to use language which is more current than that of subject headings and indexes, but they are also rich in metaphors and "noise-words- like "introduction-. 5eries titles and corporate names are of some value, as are contents notes when used. Markey and her team tried keyword access to Dewey indexes [83 and they found that, as with LC5H, the Dewey language is not rich in the sorts of words used by library users. CThe use of the actual classification codes as a means of linking related records is one of the subjects of a related Okapi project. It is outside the scope of this report.D 2.3.5 Recess method Phrase matching systems depend on at least the first few characters of the user's input matching the first few characters of a relevant heading. Some online catalogues are undoubtedly less effective than card or microform catalogues because rapid browsing is difficult or impossible or because there is no cross-reference facility. The Geac catalogue at the Polytechnic of the 5outh Bank offers -11- 2 Subject searching problems subject access to a List of subject word descriptions Csome descriptions are based on PRECIS, others are specific to the institution}. RLthough some of the words are tagged for retrieval. , other subject descriptions are only retrieved if the first characters are matched. On effectiveness study of this catalogue determined that while it was quite effective in specific item searches, only 3 4 % of subject searches were successful C9, p643. In some keyword systems difficulties are caused by the way in which keyword access is provided. Some catalogues allow the entry of only one keyword; most allow more than one keyword to be entered but then only retrieve records which contain all of the users' terms. 2.4 How might subject access be improved? 2.4.7 Truncation TRUNCRTION Truncation makes it possible for a user to retrieve morphologically related terms which may also have a semantic relationship. Many current catalogues offer some sort of truncation facility. Most phrase-matching catalogues will automatically retrieve headings which match the user's input but have additional characters on the right. Some of the keyword-type catalogues allow explicit truncation of words through the use of a special symbol or command. One keyword catalogue does a kind of automatic truncation on a keyword search. This is the the OCLC L5/2000 system when the user chooses to search by "keyword". It displays a list of words which the user's word partially matches; the user has to select one of the words, whereupon the system responds with a List of up to 20 or so indexes Ctitle, subject, series, author etcD in which the chosen word occurs, together with the number of titles pointed to in each index; finally, the user selects one of the indexes and can see some bibliographic records [103. This may suggest that single-keyword searching via an index display is not always satisfactory. It is unlikely that explicit truncation can be used without training. Markey [23 reports that CLR survey respondents found it difficult to use truncation. This is borne out by experience at the University of Hull, where the Geac system was modified to allow explicit truncation [113. Surveys revealed that there was little use of the facility. If truncation is to be used it must be automatic, as in systems which display headings which the user's search partially matches. But the example above shows that the automatic treatment of keywords as truncations is unacceptable. "Cat" must not retrieve records under "cataand stemming 2 Subject searching problems mite" and "catastrophe". Automatic truncation could be applied to words which do not find an exact match, but it would rarely have any effect. Of the words in the searches in Table 2.1, the ones which do not occur in the index Cafter correcting spellings) are teddyboys, Parkin, Rees, traumatology, communaLism. None of these searches is helped by treating the words as truncations (.Parkin finds Parkinson, the others do not match anything}. If the words are truncated, communaLism will find communal and traumatology will find trauma Cor would if it were in the file); but this is more efficiently done by automatic stemming. RUTOMRTIC 5TEWING It is obvious that some searches of kevword systems would work better if at least plurals, singulars and possessives were conflated. This is not always saf3 but there is no need for research before deciding whether it should be done. Rt least six of the words in Table 2.1 work more effectively if this rudimentary stemming is applied (.drug abuse = abuse of drugs etc}. We do not know of any commercial system which provides this, although there is one where it is in the specification but not yet implemented. One of the objects of this research was to determine whether a stronger form of stemming should be applied. There are several examples in the sample? searches where it might or might not be beneficial: abuse/-ing, dependants/entsI-ence, modernism!-ist , nationalised I-isation/-ising. R few of the experimental Cnon-commercialD online catalogues apply some degree of automatic stemming to the words of a search, and look them up in a stem index. Foremost amongst these is CITE, developed at the National Library of Medicine. CITE is briefly described below in 2.5. R system written by Peter Butcher Cone of the inventors of the original version of the PRECIS subject indexing method) at City University removes terminal "s". Bell and Jones' system M0RPH5 at the Malaysian Rubber Producers' Rssociation [12, 13, 14, 15] is an in-house reference retrieval system rather than an online catalogue. M0RPH5 incorporates stemming which is semi-automatic; users have a degree of control over its application. Frakes [16] discusses a system called CRTRLOG which uses Porter's stemming procedure [17] with, Frakes reports, results as good as those obtained by intermediaries using manual truncation. -13- 2 SyfeJ^et g©©p©b£©(g) pp©M©©i Od b©©@ (TD© ©y£d©©©© on 4b@ ©ff@(stiv®R)®^g ®f ^4©m©i©g in © pybl£© ©©4©t@gy© 0 S4©©©o£©g 4©©h©£qy©^ © P © di§(gy§§>®d £© Cb©p4©ip 3o L£bp©pi©(n]@ [©©racily 4bioi(k @f ©p©g©-©©f©p©©©i©g ©© p©f©pp£©g 4© ©©)©© ©^ p©£©4£©g f>©© °©©©-pp©f©p©©d° 4© Dpr©f©(pr©d0 b©©d£(n)(g§ C G ©©@° fp@f©©©©©©©3 ©©d fr@© b©©d£©g© 4© tr3©l©4©d b©©d£©g© C°©©© ©l©©° P©f©p©o©©©3c Kfe £©©Lyd© y©d©p 4b£© b©©d£©g © ©y©fe©EP ©f b©©£©@© ^ i d h ©r© ©©©4£©©©d b©l©©© ©©d d£©©y©©©d ©4 gp©©4©p ( @ © f ( £© Cb©p4©p 4 Q L(gth UK (MBFC ©It©©© © P © © © p©f©p©o©©© £© 4b© (bMFC O&m 4£©Ldi© ©44b®ygb ©©©© (L£b©©p£©© d© ©©4 ©©© 4b©©© Tb© io°h§y§§ @©(L£©© ©©4©l©gy© ©4 C©©fc©£bg© U©£©©rs£4© £© ©© ©^©©p4£©© 0 I© fMo©4b O M © P £ © © °©©© fp@©° b©©d£©g© d© ©©4 y©y©LL© ©ceyr £© b£bl£©gf©]pb£© p©©©rd^ 0 Ib©y ©r© ©ft©© b©ld £© ©©p©r©4© ©yth©p£t© f£l©© ^hich ©©© b© y©©d i© ©®©jy©©4i®© ©d4b 4b© £©d©^©^ 4© ib© b£bl£©gr©pb£© r©©@rd©o B&mm ©f 4b© php§§§°§ee§§§ 0 bp©©©£©g 4©p© ©©4al@gy©n ©l(L©w 4b© diipt©© ®f ©©d ©©l©©4£®© fp©(© °©©©a ©©dl °©©© ©l©®° ^§f§p§ne§io 0®©©©© 4© 4b© b£b!L£©g©©phi© fill© £© vi© ©© ©y4b®©iiy f£i@ 0 ©©©b b©©ding £© ©bi©b £§ l£©b©d ©£4b©p 4© b£b(L£@gr®pb£© ^©©©pdl^fl ©r 4® © app©f®pira©da b©©d£©g a W© b©v© ©© ©©id©©©© ©b©y4 4h© © K 4 © ® 4 t© ©b£©b ©y©b f©©i(Li4i©s ©p© ©©ty©LL© y§ide P f©^ ©©4©t®gy© ©©®4©©© COCLC, UR1CH3 ©11©© lib(r©pi©§) 4© ©@©^4py©4 4b©ir @ M I ©©©dl ©©d/®p pb©©©© ©gy£v©l©©®© 4©bl©© 5 ©boi©b ©p© y©©d ©y4©©©4i©©llv© Obv£®y© ®©©did©4©m ©r© ©?®pd = pbp©m© ©gyi©©[l©©©©® Lib© D USR° s a U©i4©d S4©t©© ©f Pbro©©£©©D o Th© U©i©©r©it© ©f C a l i f o r n i a © b E L W L ©v©4©m yiii © 4©bt© f©r 4b© ©^p©®©i@© ©f ©bbp©©i©4i©n© ©®©)©©0 Qxf®pdlo HS)7&o London § 0m lib, 13730 1SS-B1o 14 BELL C L (l @©d JSbE3 K Pa Ibx© d©©©l®pp^©ot of © bigbly M £©t©p©ci£©© ©©©pdhirog) 4©dhoi(°p© f©p (MBPPFB b£n£©®©ipyt©p 0p©p©t©d P©tp£©y©l £P©pt£©fL(Ly b©yp£©t£©]) Sy©t©©Do I(n)ff©(r>m©©©(§(o?£©)gg ©f tb© tbipd J©£©t PCS ©©b PCb ©ywpo^iy©) biog'© C@ii©g©, 3©/©bp£dg©, I~S Poly 1334* Editod bv C J v©n Pij©b©pg©o 0 C©i©bp£dg© Uo£©©p©£t© P P © © © ©n b©b©Lf ©f tb© 3p£t£©b C©©pyt©p 3®©£©ty, 1383 , 1? POPTEP bo Fo Rb ©lg@p£ib©] t®p ©yft£^ ©tp£pping D 14 C35, 1330, 130-137 0 Pp^gpmm 1 3 UNIVERSITY 3F CPLTFPPblPo OlblSTON OF LT3PPPY . RUTQMPTIONo bELbYL P©f©p©oo© b©ny©i % b©i©©p©£ty ©f Calif©p©i© 0©1£©© C©t©t©g 0 3©p(k©l©y g tb© Uni©©r©ity^ 33 KELLY 3 ©nd @tb©Pi© Bibti®gp©pbi© Poo©©© & Control Symton© Xof©^mmt%©n T©©b©©l®gy ©©b L£bp©p£©s 1 C23, Jon© 1382, 125°132 0 IQ FEblCFEL P P ©ob IMPiNETT B 0 o Pb ©ppl£©©t£@©°£©d©p©©d©©t ©yb©v^t©^ ^ ® P bp©©°t©^t ©(©©Ly©£©0 3©©)p©t©p^ ©©b !£©©)©b£©©l P©©©©p©b 3, 1373, 1S3°1B7 0 !1 COCblPPIME P Ho 0 pp£©©dty D ©©t©L©gy© f@pg£v©^ y©©p © P P @ P © § ©© i£bp©p£©n £©t©pv©©t£@© © © © © © © © P © ©^ dp©©© @©L£o© ©V©t©p^ ©©LL©d D P©p©p©b©©©° o P©)©p£©©© Libp©p£©^ 13 CSD , b©y 1332, 303^303o 0OSZKOCS T Eo FED § ©o F©©©o£©t£v© Iot©p©ot£v© Di©ti@©©py f@p @©L£©© ©©©r©b£©g 0 0©ii©© 3©vi©^ 1378, 133°173 0 C2) „ ^\{3\„ 2 Subject searching problems 23 DD5ZKDC5 T E and RRPP B P. Searching MEDLINE in English : a prototype user interface with natural language query, ranked output and relevance feedback. Qmerican Society October for Information Science. Rnnual 1373 : Minneapolis.?. Information Meeting C42nd : Choices and Poiicies 16. Edited by R D Tally and R R Deultgen. 1379, 131-133. 24 DbSZKOCS T E. From research to application : the CITE natural language information retrieval system. In : Research and Development in Information Retrieval : Proceedings BerLin 1382. Edited by Gerard Sal ton and Hans-Jochen 5chneider. Berlin : Springer-VerLag, 1383, 251-262. 25 D05ZK0C5 T E. CITE NLM : natural language searching in an online catalog. Information Technology and Libraries 2 C4D, December 1383, 364-380. 26 ULMSCHNEIDER J E and D05ZK0C5 T E. 0 practical stemming algorithm for online search assistance. Online Review 7 C4D, Rugust 1383, 301-315. 27 5IE6EL E R and others. R comparative evaluation of the technical performance and user acceptance of two prototype online catalog systems. Information Technology and Libraries 3 C1D, March 1384, 35-46. 28 5IE6EL E R and others. Research strategy and methods used to conduct a comparative evaluation of two prototype online catalog systems. National Online Meeting New York, Rpril 12-14 1383. Proceedings. 1383. -13-