G

E v a

L u a

t

±

o n

6.1

Objects

of

the

evaluation

B e f o r e p l a n n i n g t h e e v a l u a t i o n we d r e w u p t h e f o l l o w i l i s t o f q u e s t i o n s t o w h i c h we h o p e d t o f i n d a n s w e r s , c o n c l u s i o n s on each o f them a r e g i v e n i n C h a p t e r 9 . Stemming 1 and speLLing standardisation (6.2)

Does i t s i g n i f i c a n t l y i n c r e a s e r e c a l l ? I f s o , f o r what types of search? I n p a r t i c u l a r , how o f t e n do stemmed searches succeed where they would f a i l w i t h o u t stemming? Does stemming s i g n i f i c a n t l y decrease p r e c i s i o n or lead t o f a l s e drops? How does the use of b o t h s t r o n g and weak stemming CEXP system) compare w i t h weak stemming o n l y (CTL system)? For example one might f i n d t h a t t h e r e are, on average, fewer r e p h r a s i n g s of searches on EXP than on CTL. Does the EXP system's t w o - l e v e l merge C6.5) make any d i f f e r e n c e (except t o decrease search speed)? I s t h e r e a case f o r u s i n g s t r o n g stemming o n l y ? I f so, should t h i s a p p l y t o a l l searches, or o n l y t o those cont a i n i n g more than a c e r t a i n number Ctwo, say) of terms? cor section (6.4)

2

3

4

5

SpeLLing 6

How e f f e c t i v e i s EXP's s e m i - a u t o m a t i c c o r r e c t i o n procedure? How does i t compare w i t h u s e r s ' response t o CTL's §CRN'T FIND' message? CFigs 7.5 and 7 . 6 ) . GO/SEE List (6.3)

The 7

How o f t e n does i t make any d i f f e r e n c e ? Does our l i s t cont a i n a p p r o p r i a t e e n t r i e s ? How should one compile such a l i s t f o r a g i v e n environment? Does the l i s t 5tates§)? lead t o f a l s e drops ( ' u s 1 [pronoun] = ' U n i t e d

8

9

Should t h e r e be more than one type of o b j e c t i n the Ce.g. see a l s o s as w e l l as sees)?

list

-107-

8 EvcsLuatioo imps11 pmp^&piian ©f mind bmh-mviour with ti
it&m

10 Whet sort of conceptual models do users hev© of the c©t^= logye? (How do they think it works? Is it c©ef©pf<sM© tc use? is it (inciting or bcring or fitly? 11 Do©! it give © dengerous impression ©f cleverness ©r ©f infallibility?

d o 2 G u t t e d © I ©gy 5jfe hed t o d e c i d e whmut e x p e r i m e n t s t© E i r r v ©ut o ^nd h©u om<^nv mys tecum i© c o m p i r i a Th© ©v©nty©il c h o i c e w n i n f l u x ©nc©d by thee n e e d t© ©@ll©©i dat© © m ©©nsiderable* number m Cbuyodb-weds} ©f s e s s i o n © mnd by tb© (Limited t i m e mymilmblm 0
Qo2aH EywMuesfi©© ©©©gid©in©t£©©g

Thd f©LLosing points had to be considered 0
1 Since much catalogue us© i§ ©f D cesuel matureff w© wanted t© avoid motivating subjects (users) by putting them in en experimental situation The emphasis was t© be @n natural, (Live use under unsupervised conditions This ruled mut the typ^ @f ©Kperimant where search topics ©re suggested t© volunteer subjects0 There was very little perceptible difference between m y ©f the*available versions ©f Okupi 'SB CEXP, CIL end m third system (OSTEM) which ©ffers none ©f the retrieval ©idslh The dialogue0 screen layoutsfl almost all the ©ptiens and the graphic file were idenfieal 0 Bttbcugh a user night think DIt doesm^t always find the seme books when 1 d© the seme search0 „ ©r aS©metimas it syggests spelling corrections, sometimes it doesnhh^ there is n© doubt that people would regard ©11 three versions as being °fhe seme catalogued Preliminary trials showed that most searches would retrieve substantially the sum© records (sometimes in © different sequence) ©n the EXP end the CTL systems0 This meant thet a large number @f yssr serrr^ri would have to be studied f© determine whether there are significant differences between these two systems C°m@isea is ysuelly the most significant factor in the ©valuation of 1R systems] 0 We estimated thet severel hundred sessi©ns woytd be needed0

2

3

4) The bat© collection methods readily available were automatic transaction logging and p@sf°eeemch interviewing0 If would (have been difficult f© implement facilities for providing printouts @n which volunteer users could indicate relevance assessmentsa Transaction logs do not give a direct indication of the reievence of the items retrieved, nor do they reliably indicate session boundaries Cio©0 the point at which user gives way f© another at the terminal 3b

8

Evaluation

8.2.2

Controiled

or

uncontrolled

experiments?

We reluctantly decided not to use a "comparison search" experiment along the lines of Siegel L1] or Markey [23. Their experiments consisted in randomly assigning volunteer users with genuine needs to system P or system B, then asking them to repeat their search on the other system. Subjects were given a printed listing of each search and asked to judge the relevance of each item retrieved on each system and to answer some general comparative questions. There is no doubt that this methodology works well when there is a considerable difference between the two systems. In our case, we had compared the two systems on a number of genuine searches taken from Okapi ; 84 transaction logs and we knew that for the majority of searches they would retrieve the same records. PI though subjects in a comparison search experiment could be asked to indicate the relevance of each item retrieved on each system, general comparative questions about features of the two systems could not be asked. We decided to observe CunobtrusivelyD as many sessions as possible at one terminal Cspace restrictions prevented observation at both terminals}, and to use the log data from observed sessions and also from unobserved sessions during the time observation was being carried out. The main purpose of the observation would be to determine session boundaries, but very short interviews were held to try to get an answer to the question "Did you find what yuu were looking for?". We thought it just possible that there might be a significant difference between the proportion of satisfied customers at the EXP and the CTL systems. The log data was to be used for the repetition of searches by the experimenters. Searches submitted to one system would be repeated on another and the system output compared . 6.3 Data collection and collation Data collection took place in the Polytechnic's Riding House Street site library. This library caters mainly for full-time and part-time students of Social Sciences, Business 5tudies and Communication. The user population is probably something over a thousand. The Polytechnic was at the time in the process of installing the new SWPLCPP LIBERTPS library management system. When data collection started many users were already familiar with the LIBERTPS online catalogue, as well as with Gkapi '84. There were also a few microfiche readers with fiche catalogues for the Polytechnic as a whole and

-109-

8 Evaluation for a number of other academic Libraries. Two Okapi terminals running the new systems were installed on 23 Get 19S6. They replaced terminals which had been running Okapi '84. The remaining two Gkapi '84 terminals were removed. Each Ukapi station was next to a LIBERTR5 terminal and near to a fiche reader. Both stations were situated in areas of heavy catalogue use, one on the first floor and one on the second. The user populations are unlikely to be the same on the two floors because of the different subject areas covered by the book stocks. R suggestion book was attached to each terminal. 8.3.1 Qcceptance tests

Before starting data collection the systems were run for six working days under continual informal observation. P few minor alterations were made to the programs during thi time, mainly to the wording of some of the screen displays and to the matching procedure for EXP's spelling correction. We were interested to see whether users found th screen dialogue comprehensible, particularly the prompts for word replacement following a "CPN'T FIND" message (Fig 7.5, 7.6 and 7.83, and whether they read the introductory screen which emphasised that the catalogue was for subject searching only CFig 7.13. The word replacement dialogue seemed very successful, but significant minority of users tried to do specific item searches, mainly by title, but a few by author. (Since there was no author index, the Latter were particularly unsuccessful.3 We placed a large notice on fluorescent card above each terminal: OKRPI '86 is an experimental computer catalogue for subject searches. OKRPI '86 WILL ONLY LOOK FOR BOOKS ON R SUBJECT. Please use one of the other catalogues if you have to Look up the title or the author of a book. If you have any suggestions or comments, please use the book. THPNK YOU FOR YOUR HELP

-110-

8

EvaLuation

The EXP and CTL systems were alternated between the floors daily CMonday to Friday} during the trial period, and during and after formal data collection. This should effectively randomise over daily and "floor" variation in user population. 8.3.2 Observation and interviewing

Observation and interviewing was carried out by one of the experimenters at the first floor terminal from 31 Oct to 12 Nov 1986. The experimenter sat at a staff desk within a few feet of the terminal. Rlthough the experimenter was conspicuously present most catalogue users seemed to assume that he was a member of the library staff, and there was little evidence that people felt that they were being observed. The experimenter recorded start and finish times, and when the user got up to leave the terminal he or she was asked to "answer a few questions about your use of the catalogue". The experimenter introduced himself Hello! I'm from the library research team which designed this experimental computer catalogue. We are talking to library users to find out how useful this new catalogue is for you. I'd like to ask you a few questions about the search which you have just done - it won't take longer then two or three minutes. and then conducted the following interview. 8.3.3 The DOTE: interview FLOOR:
m

/Nov/1966

TIME:

1. Have you used this particular computer catalogue before? NO.... YES.... Ha^e you used it in the Last two or three weeks? YES.... Rbout how many times? NO 2. What were you looking for? (prompts: specific books, subject (books about something)) 3. Did you find what you were looking for? NO.... Did the computer find any useful books? YES....

-111-

8

Evaluation

4. Did you have any particular problems in using the catalogue? YES.... Do you have any suggestions for making the catalogue better? NO.... Would you like to suggest any improvements all the same?

There was plenty of room on the interview sheet for the experimenter to record users' comments, descriptions of their search topic etc. Despite the notices on the terminals reminding users that the catalogue was only for subject searches, about a quarter of the interviewed users said they were looking for specific books, usually by title. When confronted with this, some users said that they had read the notice but that the catalogue worked for titles, so why not use it? CIn fact, like all purely keyword systems, it did not work very well for titles which consist of common words, such as "Introduction to sociology" or 'War and peace". Such titles can be found, but they are often swamped by large numbers of irrelevant items. Had we used a larger stopList it would have been even worse.D Pfter elimination of specific item searches there remained 121 recorded sessions. The results of the interviews are
8.3.4 Transaction Log data

given in 8.4.

Since the observed searches did not give enough data for a satisfactory comparison of the EXP and CTL systems, we carried out extensive analysis of the transaction logs for all searches carried out at both terminals during the period of observation. The logs contain a rather complete record, almost down to the keystroke level, of user input to the system. They contain enough information to enable an experimenter to repeat a search exactly provided that the repetition is done using the same search program, source file and indexes. CFor use by external researchers logs should contain complete system output including the text of record displays, but we do not do this because it makes the files unmanageably large.3 Rppendix 3 is an annotated extract from a log file. For statistical analysis, information from the log files was condensed into a file called 5RCHE5 C8.3.B3.

8

Evaluation

8.3.5 Searches and

sessions

There are frequent references in this report to sessions and searches. Some discussion of our use of these terms may be helpfuL. We defined a search to be a period of use of the system which begins with a search statement submitted to the system and ends with a return to either the search input screen CFig 7.23 or to the "home - screen CFig 7 . 1 ) . The natural definition of a session is the time during which one user Cor group of users) is carrying out a search or sequence of searches at a terminal. It begins when a user keys something at a terminal and finishes when the same userCs) leave the terminal. It is not always easy to determine natural session boundaries even by observation. For example if a user gets up, then comes back to the terminal and continues within a few minutes this might properly be regarded as one session; conversely, if the user carries out two distinct searches or sequences of searches whose topics are clearly unrelated it might be more accurate to regard this as two sessions rather than one . Since much of the user activity we analysed was not observed, for some purposes we regarded a session as being a sequence of one or mare related searches ending either- at the end of a "natural14 session or' when the same user started a search which was unrelated to the previous one. It is not always easy, or even possible, to decide whether or not a search is related to its predecessor. Where an experimenter was unable confidently to make a decision a consensus was sought. In doubtful cases sessions were discarded. 8.3.6 The SRCHES fiie

We determined session boundaries using interview sheets where available or, in the absence of observation, the time during which terminals were unused and the relationship between successive searches. It is a reasonable assumption that if a terminal is unused for three minutes a session has ended. Conversely, even if the next search is quite unrelated, it is likely to be the same user unless at least ten seconds has elapsed since the last keystroke. SE55IDN PND SEPRCH BOUNDARIES Printouts of the transaction logs were marked up with session boundaries by the three researchers. Some were cross-checked for consistency by a second person. Each session was given a reference number and the searches within each session were numbered consecutively.

-113-

8

Evaluation

C L P S S I F I C R T I O N OF SEARCHES 0 m a j o r i t y o f s e a r c h e s were what most p e o p l e w o u l d r e g a r d as r e a s o n a b l e d e s c r i p t i o n s o f a s u b j e c t C " H i s t o r y o f t h e theory of p r o b a b i l i t y " , 'Social s t r a t i f i c a t i o n * } . These were c l a s s i f i e d as t y p e 5 C S u b j e c t ) . R s u r p r i s i n g l y L a r g e p r o p o r t i o n , c l a s s i f i e d a s Q, t o o k t h e f o r m o f e s s a y t i t l e s C"By w h a t m e a n s are we e d u c a t e d f o r sexual i n e q u a l i t y i n work-3. Some s e a r c h e s w e r e e v i d e n t l y o r p r o b a b l y f o r s p e c i f i c i t e m s r a t h e r t h a n f o r " b o o k s a b o u t s o m e t h i n g " C'The anatomy o f a c c o u n t i n g " , "Cuban a g r i c u l t u r a l C s i c ] & d e v e l o p m e n t : contradictions & progress"3. These were c l a s s i f i e d as T CtitleJ.<x>

(13 I t is important and not always easy to distinguish between specific ( t i t l e ) searches and subject searches. Many subject search statements look like t i t l e s , and (without asking the user) i t is only possible to make this distinction after looking at other searches in the session, and at the relationship between the apparent relevance of the books retrieved and the time which the user spent looking at record displays. Display of a book with the exact t i t l e followed by end of session is good evidence that this was a t i t l e search. IIAien in doubt, we classified the search as subject.

There were a l s o t h e usual searches c o n s i s t i n g of obsceni t i e s or f o o l i n g a r o u n d , a r i d a f e w w h i c h we h a d t o c l a s s i f y as " r u b b i s h " . C I n t e r e s t i n g l y , t h e r e w e r e no i n s u l t s a i m e d at the c a t a l o g u e . } We a l s o made a n a t t e m p t t o c l a s s i f y t h e l a n g u a g e o f searches w i t h regard to the "appropriateness". This i s r a t h e r s u b j e c t i v e a n d we h a v e made n o u s e o f this c l a s s i f i c a t i o n i n t h e e v a l u a t i o n a p a r t from code M w h i c h was u s e d t o d e n o t e s e a r c h e s s p o i l t b y u n c o r r e c t e d m i s t a k e s C " T h e a f f e c t o f w o r k i n g women on c o n s u m e r b e h a v i o u r " ) . R E L A T I O N S H I P BETWEEN SUCCESSIVE 5ERRCHES I N H 5 E 5 5 I 0 N W i t h i n a s e s s i o n , e a c h s e a r c h was c l a s s i f i e d a s a repeat of t h e p r e v i o u s o n e , related t o t h e p r e v i o u s o n e C m t h e same s e s s i o n ) , unrelated or indeterminate. When a s e a r c h i s r e l a t e d t o t h e p r e v i o u s o n e we r e c o r d e d t h e t y p e o f r e l a t i o n s h i p a s broader, synonymous, narrower or other relatlonship. NUMBER OF TERMS I N P 5EPRCH I t i s t o be e x p e c t e d t h a t t h e e f f e c t s o f s t e m m i n g w i l l be m o r e m a r k e d w h e n t h e r e are m o r e t h a n t w o or t h r e e t e r m s i n a s e a r c h s t a t e m e n t , s o we r e c o r d e d t h e n u m b e r o f t e r m s i n each s e a r c h . 1 h e p h r a s e s r e c o g n i s e d b y t h e EXP s y s t e m w e r e

8

Evaluation

counted as a single term Csee next paragraph), so the same search statement could have a different term count depending on which system it was submitted to. Stop words were not counted. The number of terms was defined to be the number of items displayed on the * searching" screen CFig 7.4) after any substitutions or deletions of words which were not found. Thus "film editting in great britain" contains three terms on the EXP system if "editting" is found or corrected Cor two terms if it is ignored), but it contains four Cor three) terms on the CTL system because "great britain" is two words. C- search in which none of the words is found, fI and the user instructs the system to ignore them, contains no terms. There were a handful of these empty searches, and they were excluded from the statistical evaluation).
8.3.7 Description of the 5RCHE5 file

Each record in the file contained the following fields: 1 2 3 4 5 6 7 B 9 session number search number date and time system (E = EXP or C = CTL) whether observed CO or N) number of terms Cdefined above) search type Cdefined above) appropriateness of terminology Cdiscussed above) search result CN 0 R X = books found = no hit5 = user aborted with red key = user aborted with black Cend session) key)

10 number of postings with maximum weight CNMPW) Ci.e. the number of hits on an implied RND) 11 12 number of postings with •good" weight CNGW) total number of postings CNRW)

-11b-

8

Evaluation

13

user action following display of search result (6 = green key - look at records B = blue key - return to input screen to alter current search R = red key - return to clear input screen X = black key - end session T = system l e f t to time out)

1 4 15 1 6

number of records displayed by the user time spent looking at records relationship to previous search CF R I E U 0 = = = = = = f i r s t in physical session related identical equivalent unrelated other or indeterminate) relationship

1 7

Cif related)

type of CB N 5 0 = = = =

broader narrower synonymous other (sideways

relationship))

H f t e r " r u b b i s h " , " f o o l i n g " and "empty" searches had been excluded the srches f i l e c o n t a i n e d records f o r a t o t a l of 1087 searches. This was p a r t i t i o n e d i n t o the s e t s EXPFILL CBU3 searches of £XP) and LTLflLL C484 searches on C7L3. The F ( f i r s t ) and U Cunrelated t o p r e v i o u s ) searches were e x t r a c t e d from CTLPLL t o form a set of 255 i n i t i a l searches. Thus CTLPLL was regarded as c o n t a i n i n g 255 sess i o n s c o m p r i s i n g 484 searches.

8.4

Rnalysis Success

of

observation and i n t e r v i e w reported by users

data

8.4.1

rate

Rnswers t o t h e q u e s t i o n "Did you f i n d what you were Looking f o r ? " were recorded as " y e s " , " p r o b a b l y , but need t o check the shelves t o make s u r e " and " n o " . There were no o t h e r responses. I n the "no" case, users were asked the supplementary q u e s t i o n as t o whether the computer had "found anything u s e f u l " . borne of the sessions c o n t a i n e d more than one search t o p i c or group of t o p i c s , and two s u b j e c t s answered b o t h "yes" and "no" - meaning t h a t t h e i r session

-116-

8

Evaluation

had included both successful and unsuccessful Table 8.1 summarises the results. Table 6.1

searches

5uccess rate for observed sessions by system

CTL system

D P system

Total

Successful Probably successful Subtotal

49 6 55 (87.31)

42 8 50 (84.71)

91 14 105 CB6.1l)

Unsuccessful, but useful books found Unsuccessful 5ubtotal

2 6
8 (12.71)

3 6
9 (15.31)

5
1 2
17 (13.91)

Total

63

S3

122

There is no significant difference between the session success rates on the two systems. The failure rate is too low for it to be worth tabulating previous online catalogue experience against success/failure. 8.4.2 Brief analysis of the 17 "failure" sessions

P transcript of the detailed report is given as Pppendix 4. Rll but two of the 17 sessions contained more than one search. The searches given in the following analysis have been chosen as being representative. One session is omitted because it appears to consist of searches for specific titles.
1 Not in the catalogue Ctwo sessions?

"HM5Q employment statistics". This appears to work quite well but user wanted 1986. This might be counted a specific item search.

117-

8

Evaluation

"Required immune deficiency syndrome". This finds two false drops offered as "2 books found, but they don't match your search very well". User had tried "aids", and looked at the first 12 of 302 books found.
2 User's Language doesn't sessions? match index Language (seven

These searches sre reasonably comprehensible to a human, but not to the catalogue. "Generic social work" "0 definition of social work" "Employment st ructure" "Passing of laws" "Recent changes in Londons economy". This was the only search of the session. None of the 14 records was "good" - they all contained "recent" and one of the other words. "London's economy" finds eight books, one of which appears very good. "Truancy". Unfortunately this does not stem to "truant", which gives one good record. User tried "School absenteeism" and "Hbsenteeism". "Sociology of shopping". This user then tried "Shopping", looking at 50 of the 149 books, followed by "Anthropology of shopping". The indexic<x* search
"Consumer behaviour" finds p r o b a b l y - r e l e v a n t books.

(13 Indexic = type of Language used by classifiers and indexers.
3 Search too specific (four sessions}

indexing contents pages might help these. The library almost certainly has relevant material , and they ar^e clearly expressed. "Textile industry input-output tables" "Feuerbach" "The advantage of india to britain in colonial rule" "Employment trends post war"

-116-

8

Evaluation

4 Search needs

elucidation

"Sterling". The interviewer transcribed the user's description of his subject as "Economics - sterling shares and gold". "Britain as a developing country". This search was explained as "Economic development of Britain in the 18th century".
5 Too many records and

"The p o l i c e " . User Looked at 23 of the 200 records bemoaned the f a c t t h a t most of them were i n another branch of the l i b r a r y . 8.4.3 Comments made by interviewed users

Most of f o l l o w i n g comments were made by users of Okapi '86 i n response to the q u e s t i o n "Do you have any suggestions f o r improving the catalogue?" at the end of t h e i r interviews. R few were made when they were asked whether they had found what they were l o o k i n g f o r . Most users d i d not or could not o f f e r any suggestion. Some s a i d , q u i t e p o s i t i v e l y , that they c o u l d n ' t t h i n k of any lmpirovement . l h e r e were about 30 remarks l i k e 'Very good". "Easy to use." 'Seems quite easy to use.' 'Very easy to use.' "Simple.1 'No problems.1 'Says what to d o easy to follow. 1 'Quite straightforward.' 'Excellent.' "No i t ' s easy. Rbsolutely not.' 'Fairly easy - you've got the coloured keys - you just press a button and there i t i s . ' and '5traiahtforward. Better than the one w had before [Okapi e '84].'" 'Like the w y i t a gives a l l the information on one screen.' you just type in

'You can search on what you want with this i f s m buzz-words.' o e . . just typed in the category I wanted and i t them. Lovely!'
1

c m up with a e

S u r p r i s i n g l y , there were only two complaints about d i f f i c u l t y w i t h t y p i n g , one from a f i r s t time computer user. There were a few complaints about not being able do a u t h o r / t i t l e searches, i n c l u d i n g 'Do you remember the old one? I t w s really b r i l l i a n t . a You could put everything i n . I don't really like this one.'

to

-119-

8

Evaluation

Fourteen users d i d not f e e l able to assess relevance the information given i n record displays. P typical comment was [Not enough i n f o r m a t i o n ] shelves.* T h i s was f e l t were i n o t h e r - " I suppose I ' L L have t o Look on the

from

t o be more s e r i o u s when t h e branches of the L i b r a r y . and s u g g e s t i o n s are

books

retrieved

The r e m a i n i n g comments i n d i v i d u a l l y below.

listed

• W o u l d n ' t accept t h e c a t e g o r y I put i n . 1 Found ' t o o many books' Con i n d u s t r i a l relations).

Too many books on s o c i o l o g y , but none on socioLogy of shopping. •There was a huge L i s t I had t o go t h r o u g h . ' "R b i t sLow - goes t h r o u g h book by b o o k . 1 takes c o u l d be improved u p o n . 1

"The time i t

•The o t h e r system [LIBERTRS] i s more up t o d a t e . ' 'Thought you c o u l d onLy e n t e r one word, so had t o pLough t h r o u g h 300 books t o f i n d what I w a n t e d . ' •R b i t hard t o communicate w i t h it.' '..

L i k e d the use of i n d i v i d u a l keywords f o r s u b j e c t s e a r c h i n g : g i v e s you a broader a p p r o a c h ' . 5earch f o r t h i r d w o r l d development was ' h o p e l e s s - got 400+ books, m o s t l y not r e l e v a n t - had t o t r y ' R f r i c a ' i n s t e a d . ' 'Sometimes the books I want come randomly r a t h e r than at start.' 'It "It the

o n l y l o o k s f o r keywords - d o e s n ' t analyse the s e a r c h . * s h o u l d r e c o g n i s e phrases - not do words enough §,

separately.* gave

Not i n t e l l i g e n t rubbish'.

R d e f i n i t i o n of s o c i a l work'

Wanted t o know when the " l e s s w e l l m a t c h i n g ' books s t a r t e d . Should i n c l u d e journals.

Should i n c l u d e c h a p t e r s and i n d e x e s .

8

Evaluation

•Doesn't i n c l u d e works not owned by PCL.' [There are m i c r o f i c h e c a t a l o g u e s f o r a number o f o t h e r l i b r a r i e s . ] T h e r e w e r e n o c o m m e n t s o n EXPs s e m i - a u t o m a t i c spelling c o r r e c t i o n , a l t h o u g h t h i s was s o m e t i m e s s t r i k i n g l y s u c c e s s f u l and o c c a s i o n a l l y l u d i c r o u s l y w r o n g . In a search f o r ' D i a d i c i n t e r r a c t i o n s " b o t h words were p r o p e r l y c o r r e c t e d by t h e s y s t e m , w i t h t h e c o r r e c t i o n s a c c e p t e d by the user. R s e a r c h f o r "Bob G e l d o f " g a v e 'CAN'T FIND " b o b " C a c c e p t e d by t h e u s e r ) , f o l l o w e d by 'CRN'T FIND • g e l d o f - - n e a r e s t match found i s " g l e d y f " ' [a Welsh w o r d ] ; t h e user a b o r t e d t h e s e a r c h and t r i e d " e t h i o p a [ s i c ] and band a i d " w h i c h f a i l e d d e s p i t e p r o p e r c o r r e c t i o n of t h e f i r s t word. 8.5 Statistical Distribution analysis of of 5RCHE5 of file retrieved by system

8.5.1

number

records

T h e EXP s y s t e m m u s t r e t r i e v e a t l e a s t a s m a n y r e c o r d s as CTL f o r n e a r l y a l l s e a r c h e s . CThe r e a s o n s f o r the o c c a s i o n a l r e v e r s a l o f t h i s r u l e are given in a footnote under Table B.63. In particular, we e x p e c t e d t h a t there w o u l d be f e w e r " z e r o h i t s " s e a r c h e s o n EXP t h a n o n C T L . Ofter " r u b b i s h " and " f o o l i n g " searches the f o l l o w i n g r e s u l t s were o b t a i n e d . had been excluded

Table

8.2

Proportion

of

"zero

h i t s

-

searches

by

system

Searches r e t r i e v i n g . .

CTL system

EXP system

Total

..no records at a l l at Least one record at Least one record with minimum good w e i ^ i t
<1}

37

C7.6\3

31

C5.11)

68

(6.3*3

447 (92.41)

571 (94.91)

1018 (93.713

367 (75.81)

498 (82.61)

865 (79.61)

a t Least one record with max. possible weight

317 (65.51)

437 (72.61)

754 (69.41)

Total number of searches

484 (44.61)

602 (55.41)

1086

(1)

This row gives the proportion of searches which would have found at Least one record i f the search terms were combined using a boolean RND.

121-

8

EvaLuation

It appears that the EXP system is more Likely to retrieve something than is the CTL system. This difference is not very marked; it is significant at the 10% Level on a chisquared test. However, EXP is almost certainly more likely to retrieve at Least one record of "good" weight. The differences for minimum good weight and maximum possible weight are both significant at the 2% level. The table does not telL us whether EXP's 'extra' records are really any good. Rll or most of them may be false drops. Differences may be due to EXP's automatic inclusion of strong stems when necessary, to the use of the go/see List and to the system-suggested replacements for terms which are not found. This is discussed in 8.6.4. The number of terms in a search has a bearing on the hitrate. In retrieval systems which use an implicit PND, a majority of searches with three or more terms retrieve nothing: see, for example, 13, p20B]. Nearly one third of all our searches would have retrieved no records on an "all or nothing11 system CTable 8.2 aboveD. In our systems, a record containing about half the terms of the search will be retrieved and a record containing about two-thirds of the terms may be offered as "matching your search quite well". Clearly, such "best match" type systems will also tend to retrieve fewer records as the number of terms in a search increases Cunless there is no "cut-off" or minimum acceptable weight, in which case the system retrieves all records containing at Least one of the terms).

Table 8.3

Distribution of number of terms in searches

Number o terms f
1 2 3 4 and more

Number of searches
261 445 217 163 (24.01) (41.01) (20.01) (15.0%)

Cumulative \
24.0 65.0 65.0 100.0

Total

1086 (100.01)

The statistical analysis of Table 8.2 was repeated with searches broken down by the number of terms they contain. The effect of the go/see list was minimised by counting a go I see phrase as a single term. For example "Underdeveloped countries" counts as two terms when submitted to CTL but it is one term in EXP. This is shown in Table 8.4.

-122-

8

EvaLuation

Table 8.4 shows that EXP is markedly better than CTL at retrieving records of good weight for searches of three or more terms.

Table 8-4

Proportion of "good weight 1 searches by system by number of terms in search

Number of terms: System:

1 CTL

EXP

: 2 ' CJL

! EXP

3 or more CTL EXP

No records of good weight Rt least one record of good weight

6 (5.21)

43 ! 44 ' 67 50 12 (8.21) (23.21) (16.81) (78.51) (55.31)

134 146 109 213 112 151 (94.81) (91.81) : (76.81) (83.21) I (21.51) (44.71)

Column totals

115 146 ! 190 256 ! 179 201 (44.11) (55.91) ! (42.61) (57.41) 1 (47.11) (52.91)

Sample size: 1087 searches

-123-

8

Evaluation

8.6 Repetition of searches by experimenter
8.6.1 Notes on method

The proper unit for evaluating success at the catalogue is a session, not a search. It is obvious that the way a user chooses to formulate a search is, in general, influenced by previous searches. Hence repetition of users' search statements on a different system may not be a good reflection of what the user would actually have done in a real session. On the other hand it is probably no more unrealistic than getting users to search one system followed by asking them to do the same search on the other system. Thus it is nof realistic to repeat whole sessions, or searches which are clearly broadenings or narrowings of previous searches by the same user, on a system other than the one on which they were originally done. For reliable results, the only searches which should be used for repetition are those which are either the first search in a session, or are clearly unrelated to their predecessors. These are the searches we classified as F Cfirst3 and U (unrelated) in the SRCHES file C8.3.5D. F and U searches can fairly realistically be regarded as initial searches in a session; that is, as being representative of users' initial statements of their needs.
8.6.2 Measures of success

Measures commonly used include precision and recall. Much comparative evaluation of reference retrieval systems has been done using standard queries submitted to small collections of documents. The relevance of each document to each query has been decided in advance, often by a number of subject specialists. This imparts a fine objectivity, but takes no account of the real behaviour of real users. The opposite end of the spectrum is represented by experiments where the sole Cor main) criterion for the success or otherwise of a session by a user at a terminal is the user's degree of satisfaction. If the answer to the question "Did you find what you were Looking for?" is "Yes" then the session was a success. In their "Dewey Decimal Classification Online Project report Markey and Demeyer [2, Rppendix IJ use the concept "amount of useful information" as a measure. Records which bear no resemblance to the search were sometimes judged by the searcher to be useful, and these were counted as relevant by the experimenters. Repetition by an experimenter constitutes something between the two extremes. The experimenter must judge relevance as objectively as possible. The experimenter

-124-

8

Evaluation

needs experience of reference work in Libraries, and is more Likely to make realistic judgments if he or she has some knowledge of the user population and their needs.
8.6.3 Experimenters' relevance judgments

Most of our repetition searches were carried out and assessed by Richard Jones, who is an experienced reference Librarian. He tried to assess the relevance of retrieved records as a Librarian with knowledge of Local users and their needs, given only the users' searches as submitted to and Logged by the catalogue. This was usually rather easy, provided the experimenter is aware that a proportion of searches are probably for titles. For example, SEVEN DERDLY 5IN5 may have been a search for material about one of the films with this title. 5ometimes the context helps: the initial search PROGROM IN 50CIETY is only understandable given the knowledge? that it was followed by PRGbRPT^MED SOCIETY, PD5T INDUSTRIAL SOCIETY and COMPUTERIZATION OF SOCIETY. Searches for which it is impossible to judge relevance are rare. INTERFACE is an example: there is an organisation called "Interface"; there may be a book with this title, but if so it is not in the catalogue. In a case like this any book with "interface/ s/ing" in its title or subject headings would be counted as relevant. It may be argued that consensus judgments made by a panel of assessors would be more reliable, but this is not really the point. We were comparing systems, not making an absolute assessment. It is reasonable to assume that the "experimenter effect" will apply more or less equally to each of the systems. However, we would consider a criticism to the effect that the repetition experiments should have been done by someone who did not know which records were retrieved by which system. We did not have enough time to set up such an experiment. The data will still be available for a more rigorous future experiment.
8.6.4 Searches which retrieved no 'good' records': EXP vs.

CTL Since EXP appears more Likely to retrieve at least one record of good weight we repeated zero-NGW searches from CTLPLL on EXP. In order to eliminate the effect of the spelling correction a few searches were excluded, either because they invoked spelling correction when submitted to EXP, or because the original user had aborted the search following a "CRN'T FIND" message. The results CTable 6.5D were surprising. Only four of the searches retrieved any records of good weight on EXP, although another 14 searches did retrieve more records of "acceptable" weight than they did on CTL.

-125-

8

Evaluation

T h i s s u g g e s t s t h a t t h e s p e l l i n g c o r r e c t i o n may b e a m o r e i m p o r t a n t f a c t o r than the s t r o n g stemming i n the d i f f e r e n c e between t h e two s y s t e m s .

Table 8 . 5

S e a r c h e s wrfiich f o u n d no r e c o r d s w e i g h t on CTL r e p e a t e d on EXP

of

'good'

Search results

Mjnber of searches

Same Extra records of good weight - relevant - mixed - false drops Extra records, but below good weight - relevant - mixed - false drops

85 4
2 1 1

14
8 3 3

Total

103

8.6.5

Comparison

of

recall

on

first

search

of

session

Pll searches classified as F (first in a session) or U (unrelated to previous search} were selected from the set C.1LPLL. There were 255 such searches. These searches were all repeated on U5TEM and on EXP. R hundred of these searches each retrieved more than 20 records of "good" weight on 05TEM. These were discarded. It can be assumed that these 100 searches work satisfactorily - or retrieve too many records - on all three systems. Pll the records retrieved on each system by the remaining 155 searches were assessed for relevance.

8

EvaLuation

Table

6.B

Repetition

of

initial

searches

Search results

CTL system CR) compared with 05TEM CB3

EXP system CP) compared with CTL (B)

Sane records retrieved in R axl B More records in fl than B - mostly relevant - mixed

71 (27.61)

93 (38.61)

64 (2S.11) 6 (2.41)

36 (14.91) 6 (2.41)

- mostly false drops
}

4 (1.61) 10 (3.91)

9 (3.51) 3 (1.21)

Fewer records in*R than B Retrieved records not examined (more than 20 recs. on 051 EH)

100 (39.21)

100 (39.21)

Total

2S5

2S5

(1) Wiere there are fewer records this is usually due to the higher rumber of postings for a stem reducing the weight of one of the terms. Occasionally it is due to the higher weight attached to a go/see phrase in EXP. It does not necessarily indicate a worse result - rather the contrary.

M o r e than a q u a r t e r of the s e a r c h e s do better in C7L than in 051 EM, and very few do w o r s e . The d i f f e r e n c e b e t w e e n EXP and CTL C15%3 is m u c h less m a r k e d , and EXP retrieved a h i g h e r , t h o u g h not signif icant L y h i g h e r , p r o p o r t i o n of f alse d r o p s . C D M P M R I 5 D N B E T W E E N CTL W N D 057 EM Of the 74 s e a r c h e s w h i c h found m o r e records on CTL than on

057LM, 33 retrieved some additional records of maximum possible weight - that is, the records would have been retrieved on a system which uses weak stemming but combines terms using a boolean RND. Rn example is the search RBORTION RCTS. In 05TEM this finds 158 books indexed under "abortion" or "acts", but none under both. There BPB three books in the catalogue entitled "The working of the Rbortion Oct" Cwith subject headings "Great Britain - abortion - history"D. These will eventually appear in the set retrieved by 05TEM if the user persists. CTL Cor EXP1 reports M 3 books match your search exactly" and show them first, followed by the other 16 books under "abortion".

-127-

8

Evaluation

0 further 11 searches retrieved more records of at Least "good" weight, but Less than maximum possible weight Cthey would have been offered to the user as "matching your search quite w e l l " ) . The remaining 30 searches only gained records of "acceptable" weight C"N books found but none match your search very well"3. CUMPBRISON BETWEEN RLL THREE SYSTEMS Of the 53 searches which retrieved more records in EXP than in CTL, 17 were searches which had the same result on CTL as on 05TEM. They are therefore indicative of the differences Cstrong stemming and the go I see List 3 between EXP and CTL. These searches are Table 6.7 Listed in Table 8.7.

Initial Searches which were the same in CTL as in 05TEM but retrieved more records in EXP

5earch

Category Reason for greater (bood, Bad, number of records (T: lookup table Mixed} 5: strong stemming)

BBC committees clientelism American broadcasting American radio Japanese economy lesbianism [twice] American power and the new mandarins the Cuban crisis Korea
immigration and race i n B r i t i s h p o l i t i c s ideology and c u l t u r a l production external broadcasting the new theatre and cinema of Soviet Russia i n t e r war B r i t a i n BBC handbook i n d u s t r i a l concentration

G B G G G G M G G G G B 6 G G E

1 (BBC) 5 (America) T T (Japan) T 5 T T (Cuba) 7 (Korea) 5, T (immigrants) (culture) 5 5 T T (Britain) T 5

The s e a r c h e s l i s t e d i n T a b l e 8 . 7 s u g g e s t t h a t t h e go/see l i s t , s m a l l as i t i s , has a s i g n i f i c a n t e f f e c t . Names o f c o u n t r i e s being l i n k e d to a d j e c t i v e s of n a t i o n a l i t y appears t o be p a r t i c u l a r l y h e l p f u l . I n a l l the searches which behaved a n d C T L , t h e go I s e e l i s t a f f e c t e d d i f f e r e n t l y b e t w e e n EXP 23 and t h e s t r o n g s t e m -

ion

8

Evaluation

ming 3/ Cseveral searches were affected by both}. The go/see List was never detrimental, but the strong stemming Led to some false drops in nine of the searches. Cbee 8.8 for further discussion of the effect of the go/see List.3 8.7 Treatment of users' words which aretndxt ihetfoBdindex In EXPRLL and CTLRLL combined there are 1087 searches. These contain 124 instances of words where neither the weak or strong stems are in the index. CThis does not mean that 11% of searches contained a "CRN'T FIND 1 because a number of searches contained several of them. Rfter searching for SEVEN DERDLY SINS one user tried each of the sins separately, and most of them are not in the index.D
8.7.1 M i s s p e l l i n g s and miskeyings

The set EXPHLL was scanned for misspellings, mainly by looking for occurrences of 'CRN'1 FIND "<word>"' in the logs (indicated by in the Log by "(word) CF", "D. Candidate words were classified as normal misspellings or m i s k e y m g s CC0NTEMPGRY3 , words run together C2000RD, MNDPH0T013RPPHYD , rubbish CUKYIYUY) and dubious CHIST, 5H5P0C, WED6EWUDDD. The Last category contained words which Looked Like plausible abbreviations, acronyms or personal names. There were 60 words in the first category Cnormal misspellings}. Two of them iaffect and woking') were misspellings which make real words. The system treated the others as shown in Table 8.8. Plthough there were none in the set used for Table 8.8 it is possible to find misspellings for which the system suggests the wrong correction Cother than mistakes in the dictionary3. Before we tightened the matching criteria C8.3.13 we had
prosial --> parochial and poletics --> politische. Despite

trying to prevent non-English titles from contributing to the dictionary there is still quite a proportion of foreign words. With the procedure as it is at the moment a good example of this type of erroneous "correction" is Thacher --> teacher. T h a t c h e r gets the same score as teacher as a candidate replacement; teacher is offered because it is shorter CRppendix 2D. If this situation were at all frequent it would suggest that the user should, when necessary, be offered a choice of replacements.

-129-

8 Evatyataon

°3,System n^St fto §y§p>§ti®n

23

lood

Effectively ©©prated fev §ftp@ng %t(mmu§ &<9k§v®l@$mD (Sfe^oll§d (by gpiliingi §tir>d§pdii>©ti©r E(fe/£@fi£y3 t©M©M CrsotD °

1 1 2B

Syitiwi §©gpsiid ©©friction to § di§§piltiRg io th© d i o J i o n r y Ccoofmip§Py== © ^©v§t©p«©ot°°lfe/i1©puf^nt „ Bord found us wlsiptlliftg i o th© §(a©r©© ffitd CdgiPiiOj ©qyipidj ©fit£§n^ fe^§i©pi^fn(% 0@©(ir©fnt3 Hrongly ©©rrteted by §tr@ng dimming (gmpaiiml

Syitiw §ygg§§>tid i rorrdeti©© rfiidh «i§ tr©rg

Atetsi (bed)

CD <iniv^r ©ar ys , &g©©©gp«iphvy (bri^§>st©©(iy b r i f i s w h , c o n f l i c t di&L&iic s (§ducuc0)iion s @mpl@ymnt B f&piitiiy S journalism s m(MijjhodB ^ p t f @ p © t i d s „ ©©p§©©©(it 0 p© pf©t*m%n€W 0 phiL^gohys
phit@§©pht5 p©l©fi©©5 p r © i i © [ j p©y©©p<§pfyy „ § i d o c f ^
fl

s

t(itff¥iti@o^ undif f©Ptnc§>

bwte©

C23 (fdudrti£o0^ &m&gp&c^nB <§§iul£j, fey^y©©!!©! „ ddiciim&py B d&lingqw&n^y 0 dpt^§gi@n0 ®(§qu£ty s fpne^i©,, ityp£©<§ti©ns B i o d y s t p i © l l i i o g „ jydi©i<§P<§y fl p@p©yldta©n „ (r>©ii)v<§©©(i,, m<§pl„ §d>e£®(lgv© §©ci©i§y t f §©i©i<§i ^ t(i©oipy©i^ t © t y i t o y f f Every target word ©Kcepf brimstone was in the dictionary 8 Of the words for which mo correction wos suggested^ 11 eith have multiple errors or are incorrect in the first tetter { it would not bo r 3nabld> to cgxpect machine correction.

8

EvaLuation

U/orkw is too short. PoLetics should be corrected; there may be a fault in the dictionary in this region or a bug in the encoding procedure. Gne feels that it ought to be possible to correct the remaining nine words, with their single omission, insertion or substitution. Four of them QconfLice , empLoymnt , performsnce, phiLosopht') w o u l d be corrected if the encoding was truncated at four characters in the manner of the original boundex. The sample is too small to draw very firm conclusions, but some preliminary analysis of a much larger set agrees with the results in Table B.8. It suggests that rather more than half the misspellings will be properly corrected.

8.7.2

Legitimate

words

which

are

not

in

the

fiLe

It is v e r y i m p o r t a n t that the s y s t e m should not suggest a replacement u n l e s s there is a h i g h p r o b a b i l i t y that the s u g g e s t i o n is r i g h t . It is p a r t i c u l a r l y important that the s y s t e m should nut suggest r e p l a c e m e n t s for good w o r d s which) are not in the file (or any a s s o c i a t e d t h e s a u r u s } . There w e r e eight such w o r d s in the set of E X P R L L searches.

Table 8.9

L e g i t i m a t e w o r d s w h i c h w e r e not i n the file

Word

Suggested replacement

stupidity
self less

unselfish
selflessness truancy

gymnastics brimstone

VS M

none sleepless none none none none brainstem none

T h e r e w e r e a l s o five c a s e s of w o r d s b e i n g run together C2000HD, F I N R N C I R L H C C O U N T I N B , . . 3 . The s y s t e m didn't s u g g e s t a r e p l a c e m e n t for any of t h e s e . S i n c e r e p l a c e m e n t s are offered in a very neutral w a y

CRN'T FIND 'selfless' - nearest match found is 'sleepless' these rare o c c u r r e n c e s ar^ a m u s e the s e a r c h e r . probably fairly h a r m l e s s and m a y

-131-

8

Evaluation

8.7.3

The effect

of

stemming on spelling

correction

Both weak and strong stemming interact with the spelling correction procedure, because the removal of a suffix from a misspelling occasionally maps it to a valid stem. There are four examples in Table 8.6 above. If the strong stem but not the weak stem of a word is found there is a 'CRN'T FIND' message, but the user is given no choice. This only applies to EXP. CPN'T FIND 'narative' - 1 book under similar wordCs) The book found was indexed under the Swedish word "nar". R quick Look at a much Larger set of about 3000 searches of EXP found six occurrences of strong stemmed misspellings matching something in the index. Of these, two worked well and four badly. HQBBS finds CTHOMRSD HOBBES as intended and the rather dubious but possibly not incorrect word CITRTOR finds CITRTIDNC53. The bad ones are INTERGRR7ION which finds two occurrences of INTERGRRTED in the file, CGMPHRTIVE which finds COMPRR1MENTC53 , LREW Cf'or LRW3 finds LEWE5 and CRPITRLRLI5M finds derivatives of Italian CRPITRLE but doesn't find CRPITRLISM or CHPITHL which both strong stem to CRPIT. CLREW finding LEWES is a consequence of mapping "ae" to N e u in the spelling standardisation.} We can guess that what little effect strong stemming has on the treatment of misspellings is, on balance, harmful. However, it does not seem to conflate misspellings with valid words often enough for this effect to be harmful.
8.7.4 User response to 'CQNIT FIND' messages

"CRN'I FIND" ME55HGE WITH SUGGESTED REPLACEMENT Reaction is M good M if the user accepts a correct replacement offer or- rejects an incorrect offer, otherwise "bad". Of 23 suggested replacements Cthese include prosiai --> parochiai and a few others which occurred before the matching criteria were tightened), users' response was good in 21 cases and bad in the remaining 8 cases. Most of the unsatisfactory responses consisted of the acceptance of dictionary misspellings Cresearach --> reasearch") . These are usually common and plausible misspellings, so users' acceptance is not surprising. If the dictionary were more accurate it is likely that most responses would be satisfactory. Three of the eight "bad" responses, where the user rejected a correct suggestion, did not affect the search: these searchers used the blue key to enter their own replacement and did so correctly.

-137-

8

Evaluation

•CON'T F I N D " M E 5 5 W G E 5 W I T H O U T S U G G E S T E D

REPLACEMENT

We h a v e not d o n e a s e p a r a t e a n a l y s i s of user r e a c t i o n to the d i a l o g u e w h i c h o f f e r s a c h o i c e b e t w e e n typing a replacement w o r d and i n s t r u c t i n g the system to ignore the word. This a p p e a r s in the CTL s y s t e m CFigs 7.S and 7.63 whenever a w o r d is not found, and in EXP w h e n the m a t c h i n g p r o c e d u r e cannot find a n y t h i n g c l o s e e n o u g h . •Good' r e s p o n s e s i n c l u d e c o r r e c t i n g a m i s s p e l l i n g , typing a related word or w o r d s C2000RD w a s replaced by TWENTY FIR5T C E N T U R Y , G Y M N R S T I C 5 by D I V I N G 3 , and starting another s e a r c h if the word w a s correct and vital to the s u c c e s s of the search C 5 C 0 R 5 E 5 E 3 . "Bad" r e s p o n s e s i n c l u d e those w h e r e the user i n s t r u c t s the s y s t e m to i g n o r e a word a l t h o u g h it is important to the m e a n i n g of the s e a r c h C S T U P I D I T Y in THE P U L I T I C 5 R N D S O C I O L O G Y OF S T U P I D I T Y } , and those w h e r e the user r e p l a c e s one m i s s p e l l i n g w i t h another CPSYCGPHPHY by D E L I N G Q U E N C Y D . Neutral r e s p o n s e s , i n e f f i c i e n t but h a r m l e s s , are s o m e t i m e b m a d e by good t y p i s t s w h o use the red key to abort the s e a r c h and then re-enter i t . P m a j o r i t y of u s e r s seem to take the most e f f i c i e n t a c t i o n , but Table 8.10 s u g g e s t s that a higher p r o p o r t i o n of "CPN'T F I N D S " are s u c c e s s f u l l y tackled if the s y s t e m can suggest a spelLing correction.

8.7.5 Is

spelling correction

worth

while?

If s p e l l i n g c o r r e c t i o n is no m o r e than a gimmick it may not be w o r t h its space and p r o c e s s i n g r e q u i r e m e n t s . S i n c e it c a n result in "correction" to an u n i n t e n d e d w o r d , it may e v e n c a u s e s o m e s e a r c h e s to fail w h i c h w o u l d h a v e s u c c e e d e d in a s y s t e m w h e r e the o n u s is on the user to retype the word. C P L t h o u g h the s a m p l e s used c o n t a i n few of these s p u r i o u s r e p l a c e m e n t s , a quick look at a m u c h larger s a m p l e s u g g e s t s that they ar^ not p a r t i c u l a r l y rare.3 W e tested the h y p o t h e s i s that there is no d i f f e r e n c e in the q u a l i t y of u s e r s ' r e p l a c e m e n t s of C A N ' T F I N D terms b e t w e e n EXP and C T L . W e isolated e v e r y o c c u r r e n c e of "CPN'T F I N D " from E X P P L L and C T L R L L , e x c l u d i n g s e a r c h e s CEXP s y s t e m } w h e r e the replacement w a s a u t o m a t i c Cweak stem not found but s t r o n g stem f o u n d ) . W e then e x c l u d e d s e a r c h e s in w h i c h a d i c t i o n a r y m i s s p e l l i n g w a s o f f e r e d as the replacement w o r d Qcontempory, researach, etc3. There remained 109 occurrences. "Good" c a s e s ar& those in w h i c h the user typed a s e n s i b l e r e p l a c e m e n t , a c c e p t e d a s e n s i b l e system s u g g e s t i o n or a b o r t e d a s e a r c h w h e r e this w a s the most rational a c t i o n .

-133-

8

Evaluation

•Had" cases a r e those in which w e judged the replacement word accepted or typed by the user to be inappropriate, or in which the user "wrongly" aborted the search. Table 8.10 R e s p o n s e to *CFM'T F I N D " by system

Response Good Bad Total

EXP 57 C781) 16 (221) 73 C6713

CTL 23 (641) 13 (361) 36 (331)

Total 80 C7313 29 C27U 109

T h e s e f i g u r e s s u g g e s t that EXP i s b e t t e r t h a n C T L . T h e y a r e unlikely to be due to chance, but the sample is not Large enough to allow us to reject the hypothesis that there is no difference between the systems. The analysis needs to be repeated using a larger sample of searches. It may also be that searches where the user accepts a system-suggested replacement are quicker and felt to be Less stressful than searches where the user has to type a replacement. P time analysis could be done on our data, but measurement of perceived ease of use would need a Large number of interviews. (Many of our users do not appear to mind how long they spend at the catalogue, provided that something seems to be happening.D 8.8 Use of the gofsee

List

Df the 1087 searches in EXPHLL and CTLRLL combined, 268 C24.B%3 contained a word or phrase which EXP would retrieve as an entry in the go/see List. Table 8.11 is a list of the 72 go/see entries which were used. The full list is given in Rppendix 5. The high proportion of searches containing a go/see entry shows that choice of entries matches our users' search vocabulary. But the evidence as to whether searches containing a go/see entry perform better on EXP is rather circumstantial.

8

EvaLuation

Table 8.11

List of g o / s e e e n t r i e s used in the

searches

19th 20th Advertising Rfrican, Africa
ftnerica, ftnerican

BBC Brecht Children Chile Chinese, China Company Conservative party Cuban Developing country, third world EEC English, England

European, Europe first world war, world war 1 France, French German, Germany Hegel Holland India Industrial relations Industrial revolution Iraq Italy Japanese, Japan Keynes Korea, Korean Man, men Marxist, Marx Matrices, matrix

Micro electronics, microelectronics middle class Movies Social science Soviet, soviet russia, russian Taxation television, tv United Kingdom, Britain, Great Britain, UK, GB United states, U f 5i Vienna Welfare 5tate Wives Women World war 2, world war ii

Table 6.7 (repeated initial s e a r c h e s ) shows that of 13 initial s e a r c h e s w h i c h did better on EXP than CTL, 10 worked better b e c a u s e they c o n t a i n e d go/see e n t r i e s . When repeating s e a r c h e s we did not find any c a s e w h e r e the retrieval of a go I see entry was d e t r i m e n t a l . ( 1 J More s e a r c h e s need to be examined b e f o r e we can reach a conclusion.

(1) There was only one search (not in Table 8.7) where a go/see phrase was a potential source of false drops. This was a search for 'Less developed countries1. 'Developing countries' is in the list, where it is equivalenced to 'Underdeveloped countries' etc. Since the list is stored with its individual words weak stemrred it cannot distinguish between 'developing countries' and 'developed countries'. Hence 'Less developed countries' returns from the index lookup with 'less' and 'developing countries [etc]'. Rs it happens the search still behaves almost identically on the two systems, finding eight records with 'less developed countries' in their titles.

135-

8 Evaluat

ion

References 1 5IEGEL E R and others. R comparative evaluation of the technical performance and user acceptance of two prototype online catalog systems. Informal:ion T e c h n o l o g y and Libraries 3 C 1 3 , March 1984, 35-46.
2 MflRKEY K and D E M E Y E R H N. Dewey Decimal Classification Online Project : evaluation of a library schedule and index integrated into the subject searching capabilitles of an online catalogue. Final report to the Council on

Library 1986.
public

Resources.

OCLC Online Computer Library Center, an
on a

3 MITEV N N, VENNER G M and WRLKER 5. Designing
access catalogue : Okapi , a catalogue

online
local

area network. CLibrary and Information Research Report 3 9 ) . London : British Library, 1985.

-136-