G E v a L u a t ± o n 6.1 Objects of the evaluation B e f o r e p l a n n i n g t h e e v a l u a t i o n we d r e w u p t h e f o l l o w i l i s t o f q u e s t i o n s t o w h i c h we h o p e d t o f i n d a n s w e r s , c o n c l u s i o n s on each o f them a r e g i v e n i n C h a p t e r 9 . Stemming 1 and speLLing standardisation (6.2) Does i t s i g n i f i c a n t l y i n c r e a s e r e c a l l ? I f s o , f o r what types of search? I n p a r t i c u l a r , how o f t e n do stemmed searches succeed where they would f a i l w i t h o u t stemming? Does stemming s i g n i f i c a n t l y decrease p r e c i s i o n or lead t o f a l s e drops? How does the use of b o t h s t r o n g and weak stemming CEXP system) compare w i t h weak stemming o n l y (CTL system)? For example one might f i n d t h a t t h e r e are, on average, fewer r e p h r a s i n g s of searches on EXP than on CTL. Does the EXP system's t w o - l e v e l merge C6.5) make any d i f f e r e n c e (except t o decrease search speed)? I s t h e r e a case f o r u s i n g s t r o n g stemming o n l y ? I f so, should t h i s a p p l y t o a l l searches, or o n l y t o those cont a i n i n g more than a c e r t a i n number Ctwo, say) of terms? cor section (6.4) 2 3 4 5 SpeLLing 6 How e f f e c t i v e i s EXP's s e m i - a u t o m a t i c c o r r e c t i o n procedure? How does i t compare w i t h u s e r s ' response t o CTL's §CRN'T FIND' message? CFigs 7.5 and 7 . 6 ) . GO/SEE List (6.3) The 7 How o f t e n does i t make any d i f f e r e n c e ? Does our l i s t cont a i n a p p r o p r i a t e e n t r i e s ? How should one compile such a l i s t f o r a g i v e n environment? Does the l i s t 5tates§)? lead t o f a l s e drops ( ' u s 1 [pronoun] = ' U n i t e d 8 9 Should t h e r e be more than one type of o b j e c t i n the Ce.g. see a l s o s as w e l l as sees)? list -107- 8 EvcsLuatioo imps11 pmp^&piian ©f mind bmh-mviour with ti it&m 10 Whet sort of conceptual models do users hev© of the c©t^= logye? (How do they think it works? Is it c©ef©pf (13 I t is important and not always easy to distinguish between specific ( t i t l e ) searches and subject searches. Many subject search statements look like t i t l e s , and (without asking the user) i t is only possible to make this distinction after looking at other searches in the session, and at the relationship between the apparent relevance of the books retrieved and the time which the user spent looking at record displays. Display of a book with the exact t i t l e followed by end of session is good evidence that this was a t i t l e search. IIAien in doubt, we classified the search as subject. There were a l s o t h e usual searches c o n s i s t i n g of obsceni t i e s or f o o l i n g a r o u n d , a r i d a f e w w h i c h we h a d t o c l a s s i f y as " r u b b i s h " . C I n t e r e s t i n g l y , t h e r e w e r e no i n s u l t s a i m e d at the c a t a l o g u e . } We a l s o made a n a t t e m p t t o c l a s s i f y t h e l a n g u a g e o f searches w i t h regard to the "appropriateness". This i s r a t h e r s u b j e c t i v e a n d we h a v e made n o u s e o f this c l a s s i f i c a t i o n i n t h e e v a l u a t i o n a p a r t from code M w h i c h was u s e d t o d e n o t e s e a r c h e s s p o i l t b y u n c o r r e c t e d m i s t a k e s C " T h e a f f e c t o f w o r k i n g women on c o n s u m e r b e h a v i o u r " ) . R E L A T I O N S H I P BETWEEN SUCCESSIVE 5ERRCHES I N H 5 E 5 5 I 0 N W i t h i n a s e s s i o n , e a c h s e a r c h was c l a s s i f i e d a s a repeat of t h e p r e v i o u s o n e , related t o t h e p r e v i o u s o n e C m t h e same s e s s i o n ) , unrelated or indeterminate. When a s e a r c h i s r e l a t e d t o t h e p r e v i o u s o n e we r e c o r d e d t h e t y p e o f r e l a t i o n s h i p a s broader, synonymous, narrower or other relatlonship. NUMBER OF TERMS I N P 5EPRCH I t i s t o be e x p e c t e d t h a t t h e e f f e c t s o f s t e m m i n g w i l l be m o r e m a r k e d w h e n t h e r e are m o r e t h a n t w o or t h r e e t e r m s i n a s e a r c h s t a t e m e n t , s o we r e c o r d e d t h e n u m b e r o f t e r m s i n each s e a r c h . 1 h e p h r a s e s r e c o g n i s e d b y t h e EXP s y s t e m w e r e 8 Evaluation counted as a single term Csee next paragraph), so the same search statement could have a different term count depending on which system it was submitted to. Stop words were not counted. The number of terms was defined to be the number of items displayed on the * searching" screen CFig 7.4) after any substitutions or deletions of words which were not found. Thus "film editting in great britain" contains three terms on the EXP system if "editting" is found or corrected Cor two terms if it is ignored), but it contains four Cor three) terms on the CTL system because "great britain" is two words. C- search in which none of the words is found, fI and the user instructs the system to ignore them, contains no terms. There were a handful of these empty searches, and they were excluded from the statistical evaluation). 8.3.7 Description of the 5RCHE5 file Each record in the file contained the following fields: 1 2 3 4 5 6 7 B 9 session number search number date and time system (E = EXP or C = CTL) whether observed CO or N) number of terms Cdefined above) search type Cdefined above) appropriateness of terminology Cdiscussed above) search result CN 0 R X = books found = no hit5 = user aborted with red key = user aborted with black Cend session) key) 10 number of postings with maximum weight CNMPW) Ci.e. the number of hits on an implied RND) 11 12 number of postings with •good" weight CNGW) total number of postings CNRW) -11b- 8 Evaluation 13 user action following display of search result (6 = green key - look at records B = blue key - return to input screen to alter current search R = red key - return to clear input screen X = black key - end session T = system l e f t to time out) 1 4 15 1 6 number of records displayed by the user time spent looking at records relationship to previous search CF R I E U 0 = = = = = = f i r s t in physical session related identical equivalent unrelated other or indeterminate) relationship 1 7 Cif related) type of CB N 5 0 = = = = broader narrower synonymous other (sideways relationship)) H f t e r " r u b b i s h " , " f o o l i n g " and "empty" searches had been excluded the srches f i l e c o n t a i n e d records f o r a t o t a l of 1087 searches. This was p a r t i t i o n e d i n t o the s e t s EXPFILL CBU3 searches of £XP) and LTLflLL C484 searches on C7L3. The F ( f i r s t ) and U Cunrelated t o p r e v i o u s ) searches were e x t r a c t e d from CTLPLL t o form a set of 255 i n i t i a l searches. Thus CTLPLL was regarded as c o n t a i n i n g 255 sess i o n s c o m p r i s i n g 484 searches. 8.4 Rnalysis Success of observation and i n t e r v i e w reported by users data 8.4.1 rate Rnswers t o t h e q u e s t i o n "Did you f i n d what you were Looking f o r ? " were recorded as " y e s " , " p r o b a b l y , but need t o check the shelves t o make s u r e " and " n o " . There were no o t h e r responses. I n the "no" case, users were asked the supplementary q u e s t i o n as t o whether the computer had "found anything u s e f u l " . borne of the sessions c o n t a i n e d more than one search t o p i c or group of t o p i c s , and two s u b j e c t s answered b o t h "yes" and "no" - meaning t h a t t h e i r session -116- 8 Evaluation had included both successful and unsuccessful Table 8.1 summarises the results. Table 6.1 searches 5uccess rate for observed sessions by system CTL system D P system Total Successful Probably successful Subtotal 49 6 55 (87.31) 42 8 50 (84.71) 91 14 105 CB6.1l) Unsuccessful, but useful books found Unsuccessful 5ubtotal 2 6 8 (12.71) 3 6 9 (15.31) 5 1 2 17 (13.91) Total 63 S3 122 There is no significant difference between the session success rates on the two systems. The failure rate is too low for it to be worth tabulating previous online catalogue experience against success/failure. 8.4.2 Brief analysis of the 17 "failure" sessions P transcript of the detailed report is given as Pppendix 4. Rll but two of the 17 sessions contained more than one search. The searches given in the following analysis have been chosen as being representative. One session is omitted because it appears to consist of searches for specific titles. 1 Not in the catalogue Ctwo sessions? "HM5Q employment statistics". This appears to work quite well but user wanted 1986. This might be counted a specific item search. 117- 8 Evaluation "Required immune deficiency syndrome". This finds two false drops offered as "2 books found, but they don't match your search very well". User had tried "aids", and looked at the first 12 of 302 books found. 2 User's Language doesn't sessions? match index Language (seven These searches sre reasonably comprehensible to a human, but not to the catalogue. "Generic social work" "0 definition of social work" "Employment st ructure" "Passing of laws" "Recent changes in Londons economy". This was the only search of the session. None of the 14 records was "good" - they all contained "recent" and one of the other words. "London's economy" finds eight books, one of which appears very good. "Truancy". Unfortunately this does not stem to "truant", which gives one good record. User tried "School absenteeism" and "Hbsenteeism". "Sociology of shopping". This user then tried "Shopping", looking at 50 of the 149 books, followed by "Anthropology of shopping". The indexic"' in the logs (indicated by in the Log by "(word) CF", "D. Candidate words were classified as normal misspellings or m i s k e y m g s CC0NTEMPGRY3 , words run together C2000RD, MNDPH0T013RPPHYD , rubbish CUKYIYUY) and dubious CHIST, 5H5P0C, WED6EWUDDD. The Last category contained words which Looked Like plausible abbreviations, acronyms or personal names. There were 60 words in the first category Cnormal misspellings}. Two of them iaffect and woking') were misspellings which make real words. The system treated the others as shown in Table 8.8. Plthough there were none in the set used for Table 8.8 it is possible to find misspellings for which the system suggests the wrong correction Cother than mistakes in the dictionary3. Before we tightened the matching criteria C8.3.13 we had prosial --> parochial and poletics --> politische. Despite trying to prevent non-English titles from contributing to the dictionary there is still quite a proportion of foreign words. With the procedure as it is at the moment a good example of this type of erroneous "correction" is Thacher --> teacher. T h a t c h e r gets the same score as teacher as a candidate replacement; teacher is offered because it is shorter CRppendix 2D. If this situation were at all frequent it would suggest that the user should, when necessary, be offered a choice of replacements. -129- 8 Evatyataon °3,System n^St fto §y§p>§ti®n 23 lood Effectively ©©prated fev §ftp@ng %t(mmu§ &<9k§v®l@$mD (Sfe^oll§d (by gpiliingi §tir>d§pdii>©ti©r E(fe/£@fi£y3 t©M©M CrsotD ° 1 1 2B Syitiwi §©gpsiid ©©friction to § di§§piltiRg io th© d i o J i o n r y Ccoofmip§Py== © ^©v§t©p«©ot°°lfe/i1©puf^nt „ Bord found us wlsiptlliftg i o th© §(a©r©© ffitd CdgiPiiOj ©qyipidj ©fit£§n^ fe^§i©pi^fn(% 0@©(ir©fnt3 Hrongly ©©rrteted by §tr@ng dimming (gmpaiiml Syitiw §ygg§§>tid i rorrdeti©© rfiidh «i§ tr©rg Atetsi (bed) CD st©©(iy b r i f i s w h , c o n f l i c t di&L&iic s (§ducuc0)iion s @mpl@ymnt B f&piitiiy S journalism s m(MijjhodB ^ p t f @ p © t i d s „ ©©p§©©©(it 0 p© pf©t*m%n€W 0 phiL^gohys phit@§©pht5 p©l©fi©©5 p r © i i © [ j p©y©©p<§pfyy „ § i d o c f ^ fl s t(itff¥iti@o^ undif f©Ptnc§> bwte© C23 (fdudrti£o0^ &m&gp&c^nB <§§iul£j, fey^y©©!!©! „ ddiciim&py B d&lingqw&n^y 0 dpt^§gi@n0 ®(§qu£ty s fpne^i©,, ityp£©<§ti©ns B i o d y s t p i © l l i i o g „ jydi©i<§P<§y fl p@p©yldta©n „ (r>©ii)v<§©©(i,, m<§pl„ §d>e£®(lgv© §©ci©i§y t f §©i©i<§i ^ t(i©oipy©i^ t © t y i t o y f f Every target word ©Kcepf brimstone was in the dictionary 8 Of the words for which mo correction wos suggested^ 11 eith have multiple errors or are incorrect in the first tetter { it would not bo r 3nabld> to cgxpect machine correction. 8 EvaLuation U/orkw is too short. PoLetics should be corrected; there may be a fault in the dictionary in this region or a bug in the encoding procedure. Gne feels that it ought to be possible to correct the remaining nine words, with their single omission, insertion or substitution. Four of them QconfLice , empLoymnt , performsnce, phiLosopht') w o u l d be corrected if the encoding was truncated at four characters in the manner of the original boundex. The sample is too small to draw very firm conclusions, but some preliminary analysis of a much larger set agrees with the results in Table B.8. It suggests that rather more than half the misspellings will be properly corrected. 8.7.2 Legitimate words which are not in the fiLe It is v e r y i m p o r t a n t that the s y s t e m should not suggest a replacement u n l e s s there is a h i g h p r o b a b i l i t y that the s u g g e s t i o n is r i g h t . It is p a r t i c u l a r l y important that the s y s t e m should nut suggest r e p l a c e m e n t s for good w o r d s which) are not in the file (or any a s s o c i a t e d t h e s a u r u s } . There w e r e eight such w o r d s in the set of E X P R L L searches. Table 8.9 L e g i t i m a t e w o r d s w h i c h w e r e not i n the file Word Suggested replacement stupidity self less unselfish selflessness truancy gymnastics brimstone VS M none sleepless none none none none brainstem none T h e r e w e r e a l s o five c a s e s of w o r d s b e i n g run together C2000HD, F I N R N C I R L H C C O U N T I N B , . . 3 . The s y s t e m didn't s u g g e s t a r e p l a c e m e n t for any of t h e s e . S i n c e r e p l a c e m e n t s are offered in a very neutral w a y CRN'T FIND 'selfless' - nearest match found is 'sleepless' these rare o c c u r r e n c e s ar^ a m u s e the s e a r c h e r . probably fairly h a r m l e s s and m a y -131- 8 Evaluation 8.7.3 The effect of stemming on spelling correction Both weak and strong stemming interact with the spelling correction procedure, because the removal of a suffix from a misspelling occasionally maps it to a valid stem. There are four examples in Table 8.6 above. If the strong stem but not the weak stem of a word is found there is a 'CRN'T FIND' message, but the user is given no choice. This only applies to EXP. CPN'T FIND 'narative' - 1 book under similar wordCs) The book found was indexed under the Swedish word "nar". R quick Look at a much Larger set of about 3000 searches of EXP found six occurrences of strong stemmed misspellings matching something in the index. Of these, two worked well and four badly. HQBBS finds CTHOMRSD HOBBES as intended and the rather dubious but possibly not incorrect word CITRTOR finds CITRTIDNC53. The bad ones are INTERGRR7ION which finds two occurrences of INTERGRRTED in the file, CGMPHRTIVE which finds COMPRR1MENTC53 , LREW Cf'or LRW3 finds LEWE5 and CRPITRLRLI5M finds derivatives of Italian CRPITRLE but doesn't find CRPITRLISM or CHPITHL which both strong stem to CRPIT. CLREW finding LEWES is a consequence of mapping "ae" to N e u in the spelling standardisation.} We can guess that what little effect strong stemming has on the treatment of misspellings is, on balance, harmful. However, it does not seem to conflate misspellings with valid words often enough for this effect to be harmful. 8.7.4 User response to 'CQNIT FIND' messages "CRN'I FIND" ME55HGE WITH SUGGESTED REPLACEMENT Reaction is M good M if the user accepts a correct replacement offer or- rejects an incorrect offer, otherwise "bad". Of 23 suggested replacements Cthese include prosiai --> parochiai and a few others which occurred before the matching criteria were tightened), users' response was good in 21 cases and bad in the remaining 8 cases. Most of the unsatisfactory responses consisted of the acceptance of dictionary misspellings Cresearach --> reasearch") . These are usually common and plausible misspellings, so users' acceptance is not surprising. If the dictionary were more accurate it is likely that most responses would be satisfactory. Three of the eight "bad" responses, where the user rejected a correct suggestion, did not affect the search: these searchers used the blue key to enter their own replacement and did so correctly. -137- 8 Evaluation •CON'T F I N D " M E 5 5 W G E 5 W I T H O U T S U G G E S T E D REPLACEMENT We h a v e not d o n e a s e p a r a t e a n a l y s i s of user r e a c t i o n to the d i a l o g u e w h i c h o f f e r s a c h o i c e b e t w e e n typing a replacement w o r d and i n s t r u c t i n g the system to ignore the word. This a p p e a r s in the CTL s y s t e m CFigs 7.S and 7.63 whenever a w o r d is not found, and in EXP w h e n the m a t c h i n g p r o c e d u r e cannot find a n y t h i n g c l o s e e n o u g h . •Good' r e s p o n s e s i n c l u d e c o r r e c t i n g a m i s s p e l l i n g , typing a related word or w o r d s C2000RD w a s replaced by TWENTY FIR5T C E N T U R Y , G Y M N R S T I C 5 by D I V I N G 3 , and starting another s e a r c h if the word w a s correct and vital to the s u c c e s s of the search C 5 C 0 R 5 E 5 E 3 . "Bad" r e s p o n s e s i n c l u d e those w h e r e the user i n s t r u c t s the s y s t e m to i g n o r e a word a l t h o u g h it is important to the m e a n i n g of the s e a r c h C S T U P I D I T Y in THE P U L I T I C 5 R N D S O C I O L O G Y OF S T U P I D I T Y } , and those w h e r e the user r e p l a c e s one m i s s p e l l i n g w i t h another CPSYCGPHPHY by D E L I N G Q U E N C Y D . Neutral r e s p o n s e s , i n e f f i c i e n t but h a r m l e s s , are s o m e t i m e b m a d e by good t y p i s t s w h o use the red key to abort the s e a r c h and then re-enter i t . P m a j o r i t y of u s e r s seem to take the most e f f i c i e n t a c t i o n , but Table 8.10 s u g g e s t s that a higher p r o p o r t i o n of "CPN'T F I N D S " are s u c c e s s f u l l y tackled if the s y s t e m can suggest a spelLing correction. 8.7.5 Is spelling correction worth while? If s p e l l i n g c o r r e c t i o n is no m o r e than a gimmick it may not be w o r t h its space and p r o c e s s i n g r e q u i r e m e n t s . S i n c e it c a n result in "correction" to an u n i n t e n d e d w o r d , it may e v e n c a u s e s o m e s e a r c h e s to fail w h i c h w o u l d h a v e s u c c e e d e d in a s y s t e m w h e r e the o n u s is on the user to retype the word. C P L t h o u g h the s a m p l e s used c o n t a i n few of these s p u r i o u s r e p l a c e m e n t s , a quick look at a m u c h larger s a m p l e s u g g e s t s that they ar^ not p a r t i c u l a r l y rare.3 W e tested the h y p o t h e s i s that there is no d i f f e r e n c e in the q u a l i t y of u s e r s ' r e p l a c e m e n t s of C A N ' T F I N D terms b e t w e e n EXP and C T L . W e isolated e v e r y o c c u r r e n c e of "CPN'T F I N D " from E X P P L L and C T L R L L , e x c l u d i n g s e a r c h e s CEXP s y s t e m } w h e r e the replacement w a s a u t o m a t i c Cweak stem not found but s t r o n g stem f o u n d ) . W e then e x c l u d e d s e a r c h e s in w h i c h a d i c t i o n a r y m i s s p e l l i n g w a s o f f e r e d as the replacement w o r d Qcontempory, researach, etc3. There remained 109 occurrences. "Good" c a s e s ar& those in w h i c h the user typed a s e n s i b l e r e p l a c e m e n t , a c c e p t e d a s e n s i b l e system s u g g e s t i o n or a b o r t e d a s e a r c h w h e r e this w a s the most rational a c t i o n . -133- 8 Evaluation •Had" cases a r e those in which w e judged the replacement word accepted or typed by the user to be inappropriate, or in which the user "wrongly" aborted the search. Table 8.10 R e s p o n s e to *CFM'T F I N D " by system Response Good Bad Total EXP 57 C781) 16 (221) 73 C6713 CTL 23 (641) 13 (361) 36 (331) Total 80 C7313 29 C27U 109 T h e s e f i g u r e s s u g g e s t that EXP i s b e t t e r t h a n C T L . T h e y a r e unlikely to be due to chance, but the sample is not Large enough to allow us to reject the hypothesis that there is no difference between the systems. The analysis needs to be repeated using a larger sample of searches. It may also be that searches where the user accepts a system-suggested replacement are quicker and felt to be Less stressful than searches where the user has to type a replacement. P time analysis could be done on our data, but measurement of perceived ease of use would need a Large number of interviews. (Many of our users do not appear to mind how long they spend at the catalogue, provided that something seems to be happening.D 8.8 Use of the gofsee List Df the 1087 searches in EXPHLL and CTLRLL combined, 268 C24.B%3 contained a word or phrase which EXP would retrieve as an entry in the go/see List. Table 8.11 is a list of the 72 go/see entries which were used. The full list is given in Rppendix 5. The high proportion of searches containing a go/see entry shows that choice of entries matches our users' search vocabulary. But the evidence as to whether searches containing a go/see entry perform better on EXP is rather circumstantial. 8 EvaLuation Table 8.11 List of g o / s e e e n t r i e s used in the searches 19th 20th Advertising Rfrican, Africa ftnerica, ftnerican BBC Brecht Children Chile Chinese, China Company Conservative party Cuban Developing country, third world EEC English, England European, Europe first world war, world war 1 France, French German, Germany Hegel Holland India Industrial relations Industrial revolution Iraq Italy Japanese, Japan Keynes Korea, Korean Man, men Marxist, Marx Matrices, matrix Micro electronics, microelectronics middle class Movies Social science Soviet, soviet russia, russian Taxation television, tv United Kingdom, Britain, Great Britain, UK, GB United states, U f 5i Vienna Welfare 5tate Wives Women World war 2, world war ii Table 6.7 (repeated initial s e a r c h e s ) shows that of 13 initial s e a r c h e s w h i c h did better on EXP than CTL, 10 worked better b e c a u s e they c o n t a i n e d go/see e n t r i e s . When repeating s e a r c h e s we did not find any c a s e w h e r e the retrieval of a go I see entry was d e t r i m e n t a l . ( 1 J More s e a r c h e s need to be examined b e f o r e we can reach a conclusion. (1) There was only one search (not in Table 8.7) where a go/see phrase was a potential source of false drops. This was a search for 'Less developed countries1. 'Developing countries' is in the list, where it is equivalenced to 'Underdeveloped countries' etc. Since the list is stored with its individual words weak stemrred it cannot distinguish between 'developing countries' and 'developed countries'. Hence 'Less developed countries' returns from the index lookup with 'less' and 'developing countries [etc]'. Rs it happens the search still behaves almost identically on the two systems, finding eight records with 'less developed countries' in their titles. 135- 8 Evaluat ion References 1 5IEGEL E R and others. R comparative evaluation of the technical performance and user acceptance of two prototype online catalog systems. Informal:ion T e c h n o l o g y and Libraries 3 C 1 3 , March 1984, 35-46. 2 MflRKEY K and D E M E Y E R H N. Dewey Decimal Classification Online Project : evaluation of a library schedule and index integrated into the subject searching capabilitles of an online catalogue. Final report to the Council on Library 1986. public Resources. OCLC Online Computer Library Center, an on a 3 MITEV N N, VENNER G M and WRLKER 5. Designing access catalogue : Okapi , a catalogue online local area network. CLibrary and Information Research Report 3 9 ) . London : British Library, 1985. -136-