1
1.1

INTRODUCTION
The Okapi projects

The work described in this report is the third major project in a continuing series concerned with research and development in the area of bibliographic retrieval systems for direct use by end users, and with subject access to online catalogues in particular. These projects have become known as the Okapi projects after the name of the original prototype system of 1384. Most of the work has been funded by the British Library Research and Development Department under a series of grants since 1982. L p to the present the work has been carried out at i the Polytechnic of Central London CPCL), but it is expected that future projects wiLl be done at City University, London and at the University of Bath. Published material on the Okapi projects prior to the present one includes [J0NE8B, J0NE88, MITE85a, MITE85b, MITE85c, WRLK66, WRLK87a, WRLK87b, WRLK88, UlfiLK89a3. The first Okapi project C1982-1385] comprised the design and development of a networked microcomputer-based online catalogue. This was originally intended to be part of an investigation of library uses of networked microcomputers. Because of the interests of the investigators, it became an attempt to develop a prototype end user retrieval system which aimed to combine instant usability with a relatively high degree of effectiveness. The resulting system, Okapi '84, was installed in one of the PCL's site libraries, giving access to a database of about 90,000 monograph records, and a limited amount of evaluation was done. Okapi '84 offered users just two types of search, which it offered as "Specific books" and "Books about something". The specific item search used some fairly elaborate techniques "behind the screen", while appearing rather simple to the user. Users were encouraged to fill in whatever they knew of the title and/or author of the sought item, and the system would take courses of action which depended an the information provided and on the results of initial searches. The subject search, on the other hand, was extremely simple. It simply looked for records which contained as many as possible of the words of the user's freely expressed search. This project was reported by Mitev, Venner and Walker in [MITEB5b3. Such evaluation data as was collected suggested that Okapi ;84's specific item search was not particularly successful, whereas the very simple subject search was relatively effective. Subsequent work has been mainly concerned with subject access; most of the later Okapi systems have offered subject searching only.

1.2

Notes on terminology

The following notes introduce our use of a few technical terms. Rn information retrieval CIR3 system is a computerized system for the retrieval of bibliographic references. We only consider interactive IR systems - ones where users obtain results while they

-1-

1

Introduction

wail, - L - C L V , we onLv discuss end user interactive IR systems. These are systems which are intended for and mainly used by the people who want the information Clibrary patrons, researchers]. Rn online catalogue Conline public access catalogue, QPRC3 is an end user interactive IR system by means of which library users try to find references to material held in a library. The references retrieved with IR systems are often referred to as documents (although this is precisely what they are not). R query usuaLLy means the search statement which a user types into an IR system. Query modification is alteration of a query by the usennr the system or both. Query expansion is a little narrower than query modification, suggesting the addition of terms to the query. There are many possible types and varieties of query expansion. Originally, we used automatic query expansion to mean expansion which is applied without anv user intervention, for example the addition by the system of morphologically related terms to the user's query; semi-automatic query expansion appLied to systems where terms were selected automatically but the user chose whether to invoke an expanded search. To conform with other peoples' usage, we now refer to the Latter as automatic query expansion, and semi-automatic expansion applies to cases where the user can intervene by selecting or rejecting, and possibly ranking, terms which the system has suggested. Relevance is a nebulous relation between a document, a query and a user. The terms pertinence or usefulness or utility may be used when there is an emphasis on the satisfaction of the current needs of a specific user. Relevance information is information about the relevance or non-relevance of documents to a query. In principle there can be any number of degrees of relevance - a document can be "probably reLevant", "fairly relevant" or "almost certainly not relevant", but in the work described here we only used a binary division into something Like "likely to be relevant" and "non-relevant". Relevance feedback is the application of relevance information obtained in the early stages of a user's session an an IR system to modify the behaviour of the system during the remainder of the session. In particular, relevance feedback can be used in the implementation of query expansion. Rutamatic query expansion based on relevance feedback is the main topic of this report. The precision of a search is the proportion of retrieved records which are relevant. The recall of a search is the proportion of the relevant records whicn are retrieved. Precision and recall have been widely used as performance measures for IR systems, particularly in laboratory tests of batch systems. It is not easy to apply them in experiments with interactive systems accessing databases of realistic size.

1.3

Subject access in online catalogues

We said above said that the Okapi subject search was relatively good, but this must be qualified by emphasizing that it was only good relative to other end user retrieval systems. There is a wide measure of agreement that most systems are less effective, efficient and helpful than they ought to be. The primary subject access functions in online catalogues fall into two categories, heading or keyword. Many systems also have a search by

-7-

7

Introduction

classification code or shelfmark, but there are numerous reports that a shetfmark search option attracts Little use. Many systems offer more than one of these access methods. In response to the user's search, systems providing access by headings usually show an alohabetically ordered display of headings around that which is "nearest" to the search. The user may then browse headings in alphabetical order or choose to see the bibliographic records which are indexed by a given heading. Fig 1.1 illustrates one of the more elaborate heading-based searches, the Library of Congress Subject Heading CLC5H3 search in Ohio State University's onLine catalogue.

Fig 1.1

5ubject heading display in the Ohio State University catalogue

C0MMFND: sub/nutrition RESPONSE: TEL/ ITEMS 1 2 3 5 6 7 9 SUBJECTS SFL/ 1

764 Nutriti.cn 7 Nutrition-PBSTRACTS 3 Nutrition--Abstracts--Periodicals Nutrition—Aging effect 5ERRCH UNDER: A g i n g - N u t r i t i o n a l aspects 5 N u t r i t i o n and dental health 2 N u t r i t i o n and dental health—United States N u t r i t i o n and state SEARCH UNDER: N u t r i t i o n p o l i c y

6

MORE: PS+ BACK: PSFOR TITLES, ENTER: TBUnumber FDR NOTES O RELATED SUBJECTS CDNLY W E NUMBER IS RT RIGHT), ENTER: SAL/number R HN

S y s t e m s p r o v i d i n g s u b j e c t a c c e s s by k e y w o r d v a r y g r e a t L y w i t h r e s p e c t t o t h e s o u r c e o f t h e k e y w o r d s a n d t h e way i n w h i c h u s e r s ' i n p u t i s m a t c h e d a g a i n s t t h e i n d e x e s . 0 few o f t h e o l d e r s y s t e m s w i L L p r o c e s s o n l y a s i n g l e k e y w o r d , b u t most of t h e more r e c e n t ones w o r k b y e x t r a c t i n g t h e w o r d s f r o m t h e u s e r ' s s e a r c h a n d p e r f o r m i n g an i m p l i c i t RND o p e r a t i o n . F i g 1.2 shows p a r t o f a s u b j e c t s e a r c h f o r " C a r e o f t h e t e r m i n a l l y i l l " i n t h e U n i v e r s i t y o f C a l i f o r n i a ' s MELVYL c a t a l o g u e . H e r e , t h e w o r d s " c a r e " , " t e r m i n a l l y " and " i l l " h a v e been l o o k e d up i n a n i n d e x o f w o r d s a u t o m a t i c a l l y e x t r a c t e d f r o m LC5H. The r e c o r d s w h i c h t h e s y s t e m h a s f o u n d p r o b a b l y c o n t a i n t h e LC5H " T e r m i n a l l y i l l - - H o m e c a r e " . T h i s i s a n e x a m p l e o f a s e a r c h w h e r e k e y w o r d a c c e s s has f o u n d some r e l e v a n t r e c o r d s but h e a d i n g access would have f a i l e d . Other keyword systems o f t e n use t i t l e w o r d s i n a d d i t i o n t o w o r d s f r o m LCSH, and e v e n w o r d s f r o m s e r i e s , c o r p o r a t e a u t h o r s and n o t e s f i e l d s .

-3-

1

Ini

padluct

ion

Fig

1o2

R e s u l t o f m keyword s u b j e c t catalogue of the U n i v e r s i t y

s e a r c h ©n t h e MELVYL of C a l i f o r n i a

Search request: FIND SUBJECT C R OF THE TERMINALLY ILL OE Search r e s u l t : 11 records at a l l l i b r a r i e s Type HELP for other display options* 1. BBE0N8 Hazel 8. The dying c h i l d % an annotated.., 1988 2. BROOKS, Charles H. Cost savings of hospice hoi© care.o. 1383 3. CHILDREN'S HOSPICE ADVISORY PRNEL. Conference C13B4 % Washington, D.CJ Children J s Hospice Hdvisory Panel Conference report : December... 1984 4. The Dying byuisnic 1979 5. Horn© car© for the dying c h i l d : professional and family perspectives. 137B EL LACK, Sylvia 0. F i r s t Ftaeriean hospice ; t h r e e . . . 137S 7. LITTLE, Deborah Uniting,, Hois care for the dying : Boao 1985 8. NHTIGNHL SYMPOSIUM D CQP1WB WITH CRISIS H D HHNDTCfP C1873 i B o s t o n , . . . M M Coping kdth c r i s i s and handicap* 13S1 9. SPIEGEL, RUen D. fan health care. 1387 10. OTEBEL, RUen D. Home healthcare : horns b i r t h i n g t o . . . 1383 110 Terminal care at home, 1986

°> display 7 full

There are many problems w i t h b o t h types of s u b j e c t access„ There i s evidence t h a t a m a j o r i t y of users f i n d the more d i r e c t keyword approach p r e f e r a b l e . There would a l s o seem t o be a g r e a t e r chance of u s e r s ' terms f i n d i n g a match when the index language i s not L i m i t e d t o the o f t e n s t i l t e d and o u t - o f - d a t e t e r m i n o l o g y of c o n t r o l l e d s u b j e c t headings. However> t h e r e seems to be l i t t l e hard evidence t h a t a keyword approach produces b e t t e r r e s u l t s than access v i a h e a d i n g s . Rn i m p o r t a n t f a c t o r i n the U n i t e d Kingdom i s t h a t a l a r g e p r o p o r t i o n of l i b r a r i e s do not have s u b j e c t headings a t t a c h e d t o t h e i r b i b l i o g r a p h i c r e c o r d s , so the c h o i c e i s o f t e n between access by keywords from t i t l e s and access v i a c l a s s i f i c a t i o n . Host of the more r e c e n t systems seem t o p r e f e r keyword access^ and a l l the Okapi systems have used keywords as the p r i m a r y means of s u b j e c t access. Hhichever access method - heading or keyword - i s used, i t i s almost always found t h a t a l a r g e p r o p o r t i o n of searches r e t r i e v e no records a t aLle W i l d l y v a r y i n g f i g u r e s f o r t h e f a i l u r e r a f e P r a n g i n g from 20% t o B0%j have been quoted from v a r i o u s i n v e s t i g a t i o n s under d i f f e r e n t e x p e r i m e n t a l c o n d i t i o n s , databases and search systems. Harkey g i v e s a s e l e c t i o n of r e s u l t s i n EHRIRK84]., i a l k e r and Jones found t h a t 34% of about 1000 searches by undergraduate users of a s o c i a l sciences l i b r a r y would have found no records I f s u b m i t t e d t o a keyword system where terms are combined u s i n g an i m p l i c i t RND o p e r a t i o n EiRLKS7bp p1213. The U n i v e r s i t y of C a l i f o r n i a c a t a l o g u e monthly s t a t i s t i c s f o r March 1989 show t h a t 30.4% of searches r e t r i e v e d n o t h i n g . This f i g u r e I n c l u d e s a l l types of s e a r c h . R c o n s i d e r a b l e p r o p o r t i o n , perhaps a q u a r t e r , of these f a i l u r e s are due to s p e l l i n g or t y p i n g m i s t a k e s . T h i s should not be a s e r i o u s problem, because the m a j o r i t y of such m i s t a k e s can e a s i l y be d e a l t w i t h by d e s i g n i n g the system so t h a t I t r e f u s e s t o process the search u n t i l any

4

7

Int

roduction

unknown words have been negotiated with the user, but there are few systems which do this. We are Left with a substantial proportion of subject searches which would still fail, typically between 25% and 40%. The situation is really somewhat worse than this, because among "successful" subject searches there are many which are too general to be useful. It appears that many users quickly learn that searches are Likely to fail unless they are broad, and this may partly account for the substantial proportion of searches Like "accounting", "film", "statistics". Walker and Jones [WRLK87b] found that a quarter of searches consisted of only one word. Many of these retrieve a great many records. Dn the MELVYL catalogue in March 1989 the mean number of records retrieved CalL types of search] was 139, with an astonishingly high standard deviation of 1452. These figures include searches which retrieved no records. The University of California has an extremely Large catalogue containing more than 5,000,000 titles.

1.4

Subject access in Okapi '86

The second Dkapi project, completed in 1987, investigated several ways of increasing the recall of searches and reducing the proportion of search failures in end user systems. The methods used included computer-assisted spelling correction, automatic word stemming and automatic cross-referencing. The project and its subject search system Okapi '86 are described by Walker and Jones in [WRLK87b]. Each of the above mentioned techniques was reasonably successful at increasing recall. Eighty-three percent of 600 searches collected from live use of the system found at Least one record which the system judged to "match the search quite well* [WRLK87b, p1213. Rt Least as important as the recaLL-improvement devices is the fact that the Dkapi systems use a method of search term combination which is weaker than the usual RND ooeration. HI I versions of Okaoi, including the ones developed for the project described in the present report, combine terms on a "best match" basis. Terms are weighted in accordance with their relative frequency in the indexes, rare terms being given a higher weight than common terms. The weight given to a retrieved record is the sum of the weights of the terms common to the record and the query. The result of a search is a List of records ordered bv weight, with the best matching records at the top of the List. Term weighting schemes are discussed in 2.2, and the Dkapi term weighting and term combination procedure is described in Chapter 5 of [WRLK87b3.

1.5

Query expansion

The Okapi '86 system was reasonably effective at reducing the proportion of searches which fail completely, but there remain many searches which do not work as well as they should. It must be one of the primary aims of document retrieval system designers to produce systems which enable users to make searches which are as exhaustive or as selective as they wish. Ideally, every document on a topic should be retrievabLe without undue effort. In practice this is only the case for very small databases Cbut it is said that there are users of the British Library who go through the entire General Catalogue]. RLL Large bibliographic databases contain many items which are unlikely to be retrieved except by chance: items without subject headings, with metaphorical titles, misclassified. Rs well as enabling exhaustive searches, retrieval systems should help users to refine or focus their searches. Manv end user queries as

-5-

7

Introduciion

initially submitted to the system are not a good representation of the subject the user is "really0 looking for. Examples are "Intelligence" for the influence of heredity and culture on intelligence, or "Child development" for the effect of the mother on the development of the child. Both of these searches are likely to find some relevant records without too much effort, unless the database is very large, but there will be many records which they do not find. Searches Like "Britain as a developing country" for the economic development of Britain in the 18th century are so inappropriate that they are unlikely to work. Such queries are not uncommon. Some of them result from lack of subject knowledge, same from misapprehensions about what the system is and what it does. Nevertheless, a proportion of 'dubious3 queries will result in the user finding one or a few relevant records. This is more likely to be the case in a best match system Like Okapi than in an implicit RND system. Previous Okapi work has concentrated on the development of subject search systems which aim to maximize the likelihood of finding something relevant. Okapi '84 and '86 were relatively effective but, like most other end user retrieval systems, they were also dumb and unhelpful. Charles Hildreth remarked that they were also boring. Three ways of improving end user retrieval systems were considerea. The first of these applies mainly to online catalogues, which typically suffer from records which do not contain adequate subject descriptive material. It is the enhancement of subject description in the bibliographic records. It is likely that a catalogue with records enhanced by means of data from publishers' descriptive material will soon be set up and evaluated at the University of Bath. The second area of research was concerned more with the user than with search functions or record content. This was to work towards a system which adapted its interaction according to its picture of the user's needs, aptitude and experience. R request for funding for a project in this area was not supported. Finally, there is the subject of the work reported here. This is concerned with helping users who have already found some relevant material to obtain further, related items which have not necessarily been retrieved by the original query. There are a number of ways of tackling this. R simple and obvious one, which, surprisingly, has rarely been provided, is that of offering items classified in the same area as a relevant item. One of the few commercially available retrieval systems which offers this type of browsing in classified sequence is the BLCMP library system. Rnother way of extending or focusing a search is to branch to records with the same subject heading or other descriptors. Rt least one of the commercially available library systems CDynix) provides this option, albeit rather cLumsily. These systems and others are mentioned again in 2.4.1. We do not know of any evaluation of the effectiveness of such "pivot" techniques. Perhaps more attractive than using a single pivot is the technique of using keywords selected from relevant records as new search terms, supplementing the original query. This is usually what query expansion means in this report. Because of the paucity of subject description in many catalogue records, it may be important to make use of as much as possibLe of the available information.

1.6

Feasibility studies

There are several existing systems which offer semi-automatic query expansion. Rgain, three of them are briefly described in 2.4. There are

-6-

1

Intreduction

no known evaluation results, and none of the systems has quite the degree of instant usability which is regarded as essential in Dkapi systems. Some informal testing of query expansion techniques was carried out manually in 19B5. Late in 1986 a modified version of Dkapi '86 was made, which allowed a Dewey classification pivot search, and would accept relevance judgments and would extract terms (including Dewey numbers) from relevant records. The user could then select from these terms and instruct the system to perform a new search. Extended tests were done on this system, using a technique in which real searches were repeated by an experimenter. Query expansion using terms automatically extracted from selected records appeared extremely promising, even with no term selection by the user. Records classified near selected records were sometimes useful, but often most or all of them were far removed from the original search. The research proposal for the present project [WRLK85] suggested the use of query expansion in which the original query terms were to be supplemented only by Dewey numbers extracted from relevant records. The preliminary experiments just described showed that this was unlikely to be an outstandingly useful technique, but that query expansion using subject and title words, selected by the system not the user, as weLL as Dewey numbers, was certainly worth trying with real users.

1.7

Development of the experimental system Okapi '88

Because query expansion using Dewey classification codes aLone did not look very promising most of the development work was put into producing a practical implementation of a system providing automatic query expansion based on terms extracted from relevant records. The retrieval systems - Okapi ;88 - which were developed are described and illustrated in some detail in Chapter 3. The general appearance of the screens and also the structure of the programs and data are rather similar to previous Okapi systems, although the hardware C5un3 and operating system CunixD environment are completely different. In particular, Okapi '88 incorporates the recall improvement devices developed for Okapi '86 [WRLK87b], with the exception of spelling correction. It was originally intended that the systems should be evaluated in Live use in a library, but this was not practical. 5ome experiments were done using volunteer subjects under controlled conditions. These are described in Chapter 4.

-7-