B 5.1

R N R L Y 5 I 5 Introduction

&

R E S U L T S

This chapter contains a general comparison of the three search systems based on statistical analysis of the transaction logs and on user comments and responses to questions. This is followed by a more detailed discussion of results relating to use of the query expansion and classification browsing options. The results of the external assessment of subjects' lists of chosen records are given. FinaLLy there is a summary of users7 comments on the systems.
5.7.7 Terminology

For brevity, some terms will be used in a specific and sometimes non-standard sense in this chapter. P search is the interaction of a user with the system starting with and resulting from the input of a single search statement. P topic is one of the questions given in Appendix 4. H topic-session is one or more consecutive searches by a single user aiming to retrieve records on a single topic. R session is a sequence of one or more topic-sessions by the same user on the same system, finishing when the user finishes or changes to another system. The three systems, which were described in Chapter 3, are often referred to as D Cdumb system], Q (the system which allows query expansion by use of the "More" option but does not offer classmark browsing] and F Cthe full system, allowing both query expansion and classification browsing}. Sets of records retrieved by the systems are referred to as lists, to reflect the fact that they are rather dissimilar from the unordered sets retrieved by most of the conventional boolean retrieval systems.
5.7.2 Source and processing of data

Statistics on numbers of searches and sessions, records retrieved, seen and chosen and their source, and on timings, were obtained from automatic analysis of the transaction logs generated by the search programs while they were in use during the experiment described in Chapter 4. R computer program was written to obtain this data from the logs. There is an example transaction log in Appendix 5. There was seme manual editing of Logs to insert subject numbers and the reference numbers of the topics which were being searched (Rppendix 43. In a very few cases it was not possible for the experimenters to be certain of the topic, and these were omitted from analyses. The second main source of data was subjects' responses to questions CRppendix 33 and their comments. These were transcribed from tape recordings made during the sessions. Secondary data was obtained in various ways. Sessions were replayed either manually or automatically, using a program which read users' keystrokes from logs and passed them as input to the search program. This enabled experimenters to see exactly what the original subject had seen. One of the experimenters C5tephen Walker] made assessments of the quality of the lists of records displayed, as well as observations on user behaviour with regard to use of commands, time spent reading screen displays etc. R similar method was used to compile complete Lists of all

-55-

5

RnaLysis

&

resuLts

records chosen for each topic, together with the subjectCs) who had chosen them and the source of the record Coriginat search, query expansion or class browsing]. The primary purpose of these Lists was to produce printouts of records for external assessment for relevance (4.93, but they were also used to generate lists of the Dewey numbers at which all the chosen records were classified. Note on statistical tests Measures of system performance such as numbers of records chosen or seen are alL more or Less skewed and it may not be safe to use tests which depend on an assumption of normality. In testing for performance differences between systems we used Mann-Whitney U tests, which depend only on the rank-order of the observations. For testing differences such as the proportion of relevant records obtained under different conditions chi-squared tests were used.

5.2

General comparison of the systems

5.2.1 Statistics

on efficiency

and

effectiveness

5ome figures obtained from computer analysis of the transaction Logs are given in Table 5.1. Several measures of efficiency are possible. The ratio of the number of records chosen as relevant to the number of brief records seen is a precision-type measure (row 4 of Table 5.13. Here the Q and D were both significantly better (P > .993 than the F system. Similar results hold for the time spent per record chosen (row 53. It was thought possible that browsing and expansion facilities might Lead to a lower mean number of searches per session. R search on the full (F3 system has the potential to Lead to the selection of a wider range of records than a search on the query expansion (Q3 system (and both F and Q might show a reduction in the number of searches over the dumb (D3 system3. This was not the case. Row 1 of the table shows that the Q system had the Lowest mean number of searches per session, followed by D and F, but the differences are not significant. With our experiment it was not really possible to consider the question of relative retrieval effectiveness of the systems. If any of the systems had been markedly ineffective this would have shown up, but even the dumb system is presumably at Least as good for subject searching as most existing online catalogues. To measure effectiveness it would have been necessary to ask the subjects to try to carry out exhaustive searches, and this would have removed the experimental task even further from a realistic situation. It is true that the mean number of records chosen per topic-session is significantly greater for the Q and the F systems than for the D system (row 3 of Table 5.13, but all subjects used the D system first and a Learning effect was to be expected. Rny difference between the Q and the F systems is small and not significant at the 5% Level. Overall, users of the F system worked very much harder than users of the other systems without obvious benefit. However, it will appear Later (5.53 that they tended to produce Lists of chosen records containing a higher proportion of "good" records than those resulting from use of the other systems. The fact that F users looked at far more brief records is almost entirely due to the fact that every time they chose a record they

-56-

5

PnaLysis

&

resuLts

were asked whether they wanted to see "books shelved near this one" CFig 3.12). Most users chose this option, at Least near the start of their session Csome soon appear to have become disillusioned3, and the classified sequence was not, on the whole, an efficient source of relevant records (Table 5.43. Table 5.1 Some comparative system use statistics

Q subjects topic-sessions t-sessions with no recs chosen Comitted from the analysis] CD searches/t-session C2D brief recs seen/t-session C3) records chosen/t-session C43 brief recs seen/record chosen (5) time/record chosen Csecs) (6) time/brief rec seen CsecsD 24 65 3

F 27 64 1

D 51 108 8

all systems 51 237 12

1.88 62.3

2.25 111.3

2.02 44.1

2.04 67.4

8.3 7.5
40.7

7.9
14.1 56.1

5.4 8.2
52.5

6.3 9.8
49.7

5.4

4.0

6.4

5.1

5.2.2 User opinions

on system

ease and

helpfulness

Rfter they had used the dumb system followed by either the Q or the F, subjects were asked which system they had found easier to use and which one they had found more helpful in finding books relevant to the essay titles. The results are summarized in Table 5.2. This shows that two-thirds of the Q subjects felt the second CQ3 system to be at least as easy as the dumb system, but more than half the F subjects felt that the dumb system was easier. The hypothesis that Q subjects are more likely than F subjects to find their second system at least as easy as the dumb system is accepted at the 5% level Cchi-square = 3.433. Dn helpfulness the Q and F systems are more evenly balanced, 32% of the Q subjects and 78% of the F subjects preferring the second system, nithough it appears that the Q system may be more often found helpful than the F the difference is not significant.

-57-

5 Table 5 . 2

RnaLysis

&

resuLts

Perceived ease and helpfulness r e l a t i v e to the dumb system

Q system users C24) Ease:

F system users C27)

1st system easier no difference 2nd system easier Helpfulness: 1st system more helpful no difference 2nd system more helpful

8 7 3

16 6 5

1 1 22

5 1 21

Reasons given for the Q system being easier than the D system included: the "more" option Cfour subjects) you have to think less Ctwo subjects) it directs you better Ctwo subjects) easier to "home in on topic" cuts down the number of references learning effect Cdumb system used first) Reasons given for the D system being easier than the Q system: fewer options Ctwo subjects) easier to keep track of the topic faster simpler The F system was judged easier than the D system because of the "shelf" option Ctwo subjects) the "more" option The D system was judged easier than the F system because it's simpler Cfour subjects) it has fewer commands Cfour subjects) it has less choice Ctwo subjects) it asks no questions it goes in one direction there is less pressure it's more flexible it's more structured it's easier to navigate round it's faster it's less distracting it's more user friendly

-58-

5

Rnalysis

& results

Rll but two of the 24 Q subjects thought that the Q system was more helpful than the D system. Reasons included the following: the "more" option Ctwelve subjects) gives a greater scope of books Cfour subjects) homes in on the topic better Ctwo subjects) you don't have to think of so many words Ctwo subjects) it's easy to follow the topic it gives you a second chance to find books Twenty-one of the 27 F subjects found the F system more helpful than the D system: the "more" option Ceight subjects) the class browsing option Cfive subjects) it had more options Cthree subjects) it found more books Cthree subjects) you see more books it did the work for you you don't have to go through the entire list 5everal fairly clear-cut conclusions fallow from the comments on ease and helpfulness. The "more" option was seen as being effective as a way of homing in on a mare focused List of books, and as reducing the amount of effort both in scanning records and, to a lesser extent, in thinking of alternative ways of expressing the topic. CLass browsing was seen as helpful by an appreciable proportion of the F subjects, but its use involved a good deal of scanning of screens of records. It is worth noting that although F subjects chose more records from class browsing than from the "more" option Csee below - TabLe 5.4], the tatter was mentioned by more subjects than the former in response to the question about helpfulness. Two of the F subjects spontaneously remarked that if the "more" option were added to the dumb system this would provide the ideal combination. This combination is, of course, already present in the Q system.
5.2.3 User opinions on system usefulness

Subjects were asked Rbout what proportion of the time did you feel that the computer was useful in helping you to find books which were relevant to the essay titles you were given? Not surprisingly people found this question difficult to answer. Some were unable to suggest a figure. The numerical results are summarized in Table 5.3. The Q system looks to be useful more of the time than either of the other systems, although the results are barely significant. Many of the subjects made interesting comments about what was happening when the computer was not being useful. These supplement the responses to the questions about problems with the systems, and they are included in the comments discussed in 5.7.

-53-

5
Table 5.3 Proportion of

RnaLysis

&

results
was useful

time the computer

proportion of time system was useful 50% or Less 51% - 70% more than 70%

Q system users 4 C20%) 2 C1DW 14 C7Q%) 20

F system users 9 C36%) 3 (12%3 13 C52%] 25

D system users 11 C23%) 14 C29%) 23 C48%) 48

totals 24 C2B%3 13 C20W 50 C54%3 33

5.3

Use and performance of the query expansion and classification browsing facilities

We have already seen C5.2.2D that a substantial number of both Q and F users felt that the query expansion option rendered these systems more helpful than the dumb system, fl significant proportion of F users also mentioned classification browsing in this context. Certainly, both the options were extensively used, despite the fact that the experimental subjects were not specifically urged to try them. Table 5.4 shows the proportion of records which were chosen from lists retrieved by each of the three access facilities - the original list retrieved with the user's query terms, and lists retrieved by query expansion searches and classification browsing - on each of the three systems. On the Q system query expansion accounts for two-fifths of the records chosen. On the full system query expansion gave one-fifth of the records and class browsing more than two-fifths, so that the original list was a less important source than the expansion options. Table 5.5 shows the performance of these options in use on the Q and F systems. Performance is classified as "good", "moderate", "bad" or "failure", fl failure occurs when choice of the query expansion option led to no records being retrieved. Use of either of the options is good when it led to the user choosing three or more records from the first screen, or at least half the records retrieved if the Cquery expansion) option retrieved Less than six records. It is bad if at most one record was chosen and that not from the first screen. Any other case is moderately good. This classification is fairly arbitrary, but is intended to reflect the fact that if query expansion works properly the best records will usually be very near the top of the list. If the user has to Look at several screens to find relevant records from either of the options they are not working very well. Indeed, it was quite unusual for a user to look at aLL the records retrieved by query expansion.

-BO-

5 Table 5.4

RnaLysis

&

results source

B r e a k d o w n o f r e c o r d s s e e n and c h o s e n by s y s t e m a n d Cmeans p e r t o p i c - s e s s i o n )

Source

Q system F system C65 t-sessions) C64 t-sess)

D system C108 t-sess)

a l l systems

Original List brief recs seen records chosen b r i e f recs/choice 'More' option b r i e f recs seen records chosen briefs/choice 'Class' option b r i e f recs seen records chosen briefs/choice n i l sources b r i e f recs seen records chosen briefs/choice

33.5 (53%) 4.8 (58%) 6.9

30.B (28%) 3.0 C38%3 10.1

44.1 (10013 5.1 (100%) 8.2

37.5

4.6
B.2

23.5 (47%) 3.5 (42%) 8.4

23.9 (21%) 1.6 (20%) 15.0

N/R

26.7 2.6 10.4

N/R

56.8 (51%) 3.3 (42%) 17.2

N/fl

56.8 3.3 17.2

62.9 (10013 8.3 (100%) 7.5

111.3 (100%) 7.3 (100%) 14.1

44.1 (100%) 5.1 (100%) 8.2

67.4 6.9 9.8

Note. The true figures for the 'More" option on the F system are s l i g h t l y lower than those shewn, and the " o r i g i n a l l i s t " figures correspondingly higher. R bug i n the search program caused two searches to be affected by misbehaviour of the 'Back" option, w i t h the result that a few records shown as having come frcm the o r i g i n a l l i s t r e a l l y resulted from query expansion.

Table 5.5

P e r f o r m a n c e o f t h e q u e r y e x p a n s i o n and c l a s s o p t i o n s i n use

browsing

performance of option in use good moderate

query expansion Q system F system 23 (17%) 67 (48%) 33 (28%) 10 (7%) 133 2.14 8 C6%3 53 C47%) 53 (42%) 6 (5%)

class browsing F system 16 C7%) 77 (36%) 122 (57%) N/fl 215 3.36

bad
fail total per session

126 1.97

-G1

5

PnaLysis

&

results

5.4

Query expansion i n d e t a i l

5 . 4 . 1 Statistics

On the Q system, the query expansion option was used a total of 133 times, by 20 out of 24 subjects and in 50 out of 65 topic sessions. It Led to the retrieval of 42% of the records chosen on the Q system. On the F system it was used 125 times, by 26 of 27 subjects and in 53 out of 64 topic sessions. It Led to the retrieval of 20% of the records chosen on the F system CTabLe 5.4). The extent of use of this option is slightly surprising, as although the facility was very briefly demonstrated prior to each session subjects were not specifically encouraged to use it, and the prompt is only one among several options CFig 3.83. It may be that it would have a lower take-up in Live use of the systems. The proportion of failures Cnot retrieving any records} was low: 7% for the Q system and 5% for the F system. The combined figures for good and moderately good were 65% for the Q system and 53% for the F system CTabLe 5.5). It is clear that query expansion is a fairly prolific and relatively "easy" source of records perceived as relevant. It was significantly Cat 5%) Less usefuL and efficient on the F system than on the Q, both with regard to the number of records chosen and the number of brief records Looked at for each one chosen. It appears that F subjects spent Less time looking at the results of query expansion, perhaps because they had often already selected records from classification browsing. It is worth noting that only 15% of the query expansion records selected by F users were from the second or subsequent screen of records retrieved, as against 36% for Q users. This suggests that F users were more likely to feel that they had already chosen enough records.
5.4.2 Quality of lists of records from query expansion

The number of records chosen from query expansion by the subjects in the experiment cannot be expected to be a good indicator of the quality of the Lists. The number of records already chosen will certainLy affect users' behaviour. This is borne out by the fact that query expansion was far Less fruitful on the F system than on the Q system CTabLe 5.5); it seemed unlikely that the Lists produced by the option would be noticeably Less good on the F system. Thus an attempt was made to assess the quality of the lists of records retrieved using query expansion by the subjects in the experiment. To do this, all the Q and F searches were repeated by one of the experimenters. He looked at the first screen of brief records retrieved by each query expansion search and graded the screens in accordance with the following scale: R: flt Least four of the records Look relevant Cat least half, if fewer than eight records were retrieved) B: Several of the records are worth Looking at C: One or two of the records might be worth Looking at D: It is unlikely that any of the records would be relevant. "Relevance" was judged from the brief records only. The assessment was nearly always done with respect to the question as given in the appropriate topic sheet, not the user's actual search statement, as search statements do not always indicate what users are "really"

-62-

5

HnaLysis

& results

searching for, rendering it very difficult to assess relevance. For example, it would be difficult to know what the searcher for "Slump 1932" was looking for without knowing that the question was "How widespread was the Slump by 1932? ..". The question from the topic sheet was used even where the search statement was much broader than the sought topic, on the assumption that the user would have selected records which were relevant to the question, fln example of this is the search "Welfare economics" for the question "Would perfectly competitive markets ensure maximization of social welfare?". In a few cases, however, the search statement was comprehensible but seemed rather remote from the sought topic, and here the search statement was used rather than the topic. Table 5.6 is a summary of the results of this assessment. Thirty query expansion searches are omitted from the table, 16 because they failed to retrieve any records and the remainder because of unidentified program errors which rendered it difficult to repeat a few of the sessions accurately. For comparison, the experimenter's assessments are cross-tabulated with the gradings for performance in ube as given in
Table 5.5.

Table 5.6

Experimenter's assessment of query expansion searches

Experimenter's assessment of query expansion Q system F system both systems

Performance of query expansion in use total good moderate bad

R

54 (44%) 47 (42%) 101 (43%)

18 3 21 5 3 8 1 1 2 0 0 0

23 35 54 21 :S 37 3 4 1 3 4 0 4

7 9 16 1 2 18 30 6 10 16 12 12 24
37 (30%) 43 (44%) 86 (37%)

3

Q F
both

3 C3H) a
27 (33%) 75 (32%) 1 (13%) 6 1 (14%) 5 31 (13%) 1 (13%) 6 1 (11%) 2 25 (12%) 124 111 235

C D

Q F
both

Q F
both

Totals

Q F
both

24 (13%) 63 (51%) 7 (6%) 55 (50%) 31 (13%) 118 (50%)

Table 5.6 shows, as expected, that there is no detectable difference between Q and the F systems in the quality of the lists of records retrieved by query expansion searches. More significantly, it shows that three-quarters of the searches fall into the fl and B categories. This is an encouraging result, particularly as the option is freely available, and was often used, even when only a single record has been chosen as the source of terms for expansion. However, this is not enough evidence to conclude that the free availability of query expansion is worth its

-63-

5

Analysis &

results

cost in computational resources. P t this stage there are numerous i unanswered questions. We do not know how many of the records chosen from query expansion were readily available from a previous list Cthe original List, a previous query expansion or, for the F system, a class browsing screen). Some of the experimental subjects, particularly some of those who used the Q system, did a query expansion search after almost every choice of a record. In these cases, once more than three or four records have been chosen, the Lists retrieved by successive query expansions tend to be almost identical [except that records which have been chosen do not appear in subsequent Lists]. These repeated searches do not usually result in the toss of potentially relevant records, but nor do they heLp the user; it would probably be more efficient for the user to select more of the available relevant records before doing another query expansion search. This raises questions about the presentation of the "More" option which are discussed in 6.2.1. 5.4.3 Users' comments on query expansion

Users of both the Q and the F systems were asked the following question after their session: Did you use the 'more' option? Did it help you to find more useful books?

Twenty-two of the 24 Q system subjects said they had used the option. The true figure was 20, and all these said that it had helped them Cthe other two said that it had not helped}. Of the 27 F system subjects all said they had used it Ctrue figure 263, 19 said that it was helpful, 7 were uncertain and 1 said that it had not been helpful. This reflects the smaller proportion of books chosen from query expansion by full system users CTable 5.4]. Subjects were asked if they would Like to comment on the facility. Fifteen of the Q and 18 of the F subjects did comment. The comments were mainly appreciative but there was a certain amount of criticism of the way in which search results were presented. Q subjects were on the whole more positive than F subjects. Some of the F subjects may not have clearly distinguished the "More" option from the classification browsing, and they certainly made less extensive use of "More" than the Q subjects; eight of the F subjects used expressions of uncertainty C"I think ..', "It was useful in some ways but ..", "I think it was 50/50."3 against only three of the Q subjects. Nine Q subjects and 11 F subjects reiterated that they had chosen books from the "More" option or that it had been useful. Six of the Q subjects said that they had used the facility "a Lot" or "all the time". One subject said that he had chosen one book from there which he thought he would not otherwise have found. The facility can bring about a shift in the emphasis of a search, not always for the better. It It It It It seemed to enable you to get more specific books helped bring out other ranges of books was helpful to feed you into new areas seemed at times as if it was getting a bit too specific was useful in some ways but it was still getting away from the search

Two Q subjects said that it was helpful when the initial search had not

-64-

5 been very successful.

RnaLysis

& results

It found more books especially on the subjects I didn't know anything about. The first section didn't come up CI was looking f r the 1932 slump) and when o I pressed 'More' for specific books it was much more helpful. Two subjects thought that it was time-consuming, and one that it was quite tiring enough browsing through the original list. It makes it mare time-consuming in some ways by breaking it down further Three users felt it to be a fault that query expansion sometimes retrieves records from the original List which have not been either chosen or rejected. I found other books listed in the original category [i.e. original list], so whether it was finding more books or whether I was just wasting time I don't know This is one of the aspects of the presentation of interactive query expansion systems which the designers had given considerable attention to. It is perhaps noteworthy that more of the subjects did not comment on this repeated retrieval of the same records.

5.5 5.5.1

Classification browsing in detail Statistics

On the F system, the only system offering this facility, the classification browsing ["books shelved near ...H) option was used a total of 215 times, by all the 27 subjects who used the system, and in G3 out of 64 topic sessions. It Led to the retrieval of 42% of the 507 records chosen by users of the F system. The extent of use is not at all surprising, since it was offered by means of a yes/no choice every time a record had been chosen relevant CFig 3.123. Invocations of this function were classified in the same way as for the query expansion option Cabove), except that there are of course no cases of failure to retrieve any records. The combined figures for good and moderately good are 43%, substantially lower than the corresponding figure for query expansion (Table 5.5). In 54% of cases the user chase no records at all. It is clear that classification browsing is often not useful, yet o^er 40% of records chosen on the F system came from this option. It was relatively inefficient as a source of records: Table 5.4 shows that a mean of more than 17 records were looked at Cin brief} for every one chosen. It was noticed that the classification seemed to be particularly ineffective in the area of computing, where there were several thousand records broadly classified at eight numbers within 001.54 Celectronic data processing). In two-thirds of the 50 invocations of this option in searches for computing topics no records were chosen. The Dewey Classification has since been revised in this area, and PCL records have been reclassified.

-G5-

5 5.5.2 Quality of Lists of

RnaLysis records

&

results browsing

from classification

In an attempt ta find out more about the operation of classification browsing 38 of the fuLL system sessions were repeated by one of the experimenters, who qraded the screens of records which the user had seen displayed following choice of the browsing option in a similar way to that described above for query expansion (5.43. The 38 sessions were all those of the first 17 subjects who used the full system except for four sessions and two part-sessions which were omitted because, for technical reasons, it was not easy to repeat the searches exactly as they had been performed. Two further part-sessions were omitted because the user's search statement did not seem likely to retrieve records on the sought topic. The screens were assessed on the basis of the brief titles displayed; no account was taken of the actual definition of the Dewey class marks Cnor was any account taken of whether the user actuaLLy chose any records). The definitions of the categories were as follows: R (good): a reasonable proportion of the records seen appeared to be about the sought topic. Example: 574.875 (cytology - membranes and cell wall) in a search for "Ion transport". B (possible): some of the records seen appeared to be not too distantly related to the sought topic. Examples: 658.403 (management decision making and information management) in a search for "Management information system design", 155.422 (child psychology - infants) in a search for "Influence of the mother on the child". C (scattered): a few of the records were somewhat related to the sought topic: the records seen covered a wide range of topics. This category usually appeared when there are very broad classification codes. Example: G21.38 (electronic and communication engineering] in a search for "Computer data in telephone network". D (remote): none of the records seen (apart from the one used as the pivot) appeared at all closely related to the sought topic. Example: records at 362.17.. (specific medical services) in a search for "Computers in medicine". These gradings may be compared, cautiously, with the identically Lettered ones used in the assessment of screens from query expansion (Table 5.6). Assessment of the classified displays was not limited to the first screen displayed. This was tried initially, but the experimenter felt that it gave results which would be less likely to reflect the behaviour of users. Query expansion usually gives the best records very near the top of the display list, with similarity decreasing steadily and relatively smoothly. Class browsing displays are far more erratic and unpredictable, and it seemed more reasonable to make assessments based on what users had actuaLly chosen to see. Table 5.7 summarizes the findings. More detailed results, with topic references, search statements and Dewey numbers, are given in Rppendix 8.

-6B-

5 Table 5.7

RnaLysis

&

results

Experimenter's assessment of a sample of the classified displays

Experimenter's assessment of classified displays R (good) B [possible] C (scattered) D Cremate) Totals

Performance of class browsing in use total 22 (22%) 43 C43%) 18 (18%) 12 (12%) 101 good 7 3 0 G 10 (10%) moderate 9 22 4 0 35 (35%) bad 6 24 14 12 56 (55%)

The experimenter's assessments of the classified displays in Table 5.7 suggest that classification browsing might be more useful in live use of a system than the results during the experiment suggest. While only 22% of uses gave a good proportion of relevant records, another 49% looked as though they may provide a few hits. The reasons for the discrepancy between the experimental use figures and the experimenter's assessments appear to be C D that the subjects in the experiment were not trying to do exhaustive searches, and C2) that in many cases substantially the same display was seen more than once: subjects chose the option more than once at the same Dewey number, and so were less likely to want to choose any records after the first occasion. 5.5.3 Users' comments on classification browsing

Users of the F system were asked the fallowing question after their session: "Did you opt to look at books shelved near the one you had chosen? Did this help you find more useful books?" Oil 27 F subjects had used this option. Twenty said it had helped them find useful books, five said it had not been useful and two were uncertain. The logs show that three subjects chose no records from classification browsing and three chose only one record. Eighteen of the subjects commented on their experiences with the classification option. Despite the favourable reaction reported above, all of the comments were in some degree critical. Subjects drew attention to the fact that the facility was not always useful and that it could be time-consuming or confusing. Some of the following comments were made in answer to the question "Did you have any problems using the computer? M , but they are more appropriately given here. Nine subjects made remarks to the effect that the faciLity had been sometimes useful and sometimes not. It was not as useful as I thought it might be. [It was useful] in a small amount of cases - not as much as the 'more' option

-67-

5
though.

RnaLysis

&

results

E i g h t s u b j e c t s r e m a r k e d t h a t books w h i c h a r e c l o s e t o g e t h e r i n t h e c l a s s i f i c a t i o n s e q u e n c e a r e n o t n e c e s s a r i l y a b o u t t h e same s u b j e c t , t h a t r e l a t e d b o o k s w e r e s e p a r a t e d by b o o k s o n a d i f f e r e n t s u b j e c t . Just because books were on the sane shelf d i d n ' t mean they were r e a l l y relevant. [ I t ] usually got you o f f the track. I had to d i s c i p l i n e myself not to waste time looking at everything. [ I ] got confused going through a l l the d i f f e r e n t t o p i c s . There's a difference between looking along bookshelves and being able to see things nearby. I put i n a keyword and found a book and Looked three or four screens up and down from that, and right at the extremes of those [ I ] found l o t s of major books. I t would have been h e l p f u l i f I could have skipped about ten books. Rather than looking along a shelf, looking down about f i v e shelves. One s u b j e c t said

or

Get r i d of t h a t !

5.6

•Objective" precision of chosen records

Perhaps the most surprising result of this experiment is that when the records chosen by the subjects were assessed for relevance by independent assessors, as described in 4.9, the F system gave markedly higher precision than the Q system. The figures are given in Table 5.8, with a breakdown by source of records Coriginal list, "more" option, "class" option}. In fact the F system is significantly better in this respect than both the D and the Q systems. The D system also gave higher precision than the Q system, but the difference in this case is less marked. It must be emphasized that the precision which was measured is that of the lists chosen by the subjects, not that of the lists produced by the systems. It is interesting to speculate about the reasons for the overall higher precision achieved by F users. Records were displayed in exactly the same full and brief formats on all the systems. Once a record had been displayed in full format F users chose about the same proportion as did Q and D users Cover all systems the proportion of displayed full records chosen was about 72%, or, putting it the other way round, about 28% of "promising" records were rejected after they had been seen in full 3. From Table 5.1, row 4, it can be seen that F users retrieved and presumably Looked at considerably more brief records than did users of the other two systems - about 14 for each record chosen as against seven or eight on the Q and the D. It is tempting to guess that F subjects rapidly became more discriminating about the choice of records, knowing that an almost unlimited number of screens of brief records were readily available.

-68-

5
Table 5.8

Rnalysis

&

results
choice of records

Rssessed p r e c i s i o n o f

subjects'

number/proportion of relevant records by system and source source: o r i g i n a l l i s t

Q system rel/total (precision} 194/280 (69.3%)

F system rel/total (precision) 145/182 (73.7%)

D system rel/total (precision) 371/535 (69.3%)

R l l systems rel/total (precision) 710/537 (71.2%)

source: "more' option

124/211 (58.8%) N/R

62/90 (68.9%) 150/203 (73.9%) 357/475 (75.2%) (32)

N/R

186/301 (61.8%) 150/203 (73.9%) 1046/1501 (69.7%) (127)

source: "class" option

Nf /l

totals

318/491 (64.8%) (51)

371/535 (69.3%) (44)

(omitted, see note)

Note. Forty choices (29 records) are omitted because of missing or "don't know" relevance assessments. These would probably have little effect on the precision figures. R further 87 choices are omitted because they are duplicates: many subjects chose the same record more than once in different searches on the same topic.

5.7

User comments about the systems

In response to the questions about "problems with the computer" and suggested improvements (Rppendix 3) some of the experimental subjects made comments which should be of interest to retrieval system researchers and designers, although by no means all the comments bear on matters relating to query expansion. Many comments were also given in explanation of what was happening when the computer was not being useful (5.2.3). There were a considerable number of mildly critical comments, but the interviewer sought these. 5he did not seek compliments. The finding that most of the subjects were readily able to make adequate use of any of the systems is a gratifying result, although the query expansion system appeared to offer the best combination of ease of use with effectiveness. 5.7.7 Choice of search terms and retrieval of non-relevant records

There were more than 20 comments about the necessity of weeding out non-relevant records ("Looking through the books", "Computer bringing up unwanted books", "I put in RI and got lots of fureign books"). Most subjects seemed to accept that this was necessary, but there were half a dozen complaints about the systems retrieving records which had little relation to the sought topic. Some of these were about false drops due to false coordination or homography and others were about records being retrieved under just some of the user's search terms. Some users feel that the system has, or at least ought to have, a linguistic knowledge that extends to the recognition of noun phrases which describe a topic, even if it is unable to find any records in response; others think that the onus is on them to find the right way of describing the sought

-63-

5 subject.

QnaLysis

&

results

I t ' s really defining what you want - I put 'social welfare' into the search and i t came up with 'social work' which wasn't the same thing. Sometimes you put something in and i t comes up with something totally absurd - i t ' s actually the dictionary definition of what you put in rather than the concept. Two users expressed the wish t o see records under each of t h e i r terms s e p a r a t e l y . One expressed doubts about the f e a s i b i l i t y o f search this:

I would have liked to have been able to have separated out the words that I selected so that instead of i t going for a combination of the words I used and that was i t , I would have been able to t e l l i t that I wanted to look at a l l the economics books. But then I'd probably have about 300 books to look at. There were a c t u a l l y about 6000 books indexed under economics. There were 16 comments about the d i f f i c u l t y of t h i n k i n g how t o express the t o p i c C " T h i n k i n g of w o r d s " , "Can't d e f i n e what you w a n t " , " P i c k i n g the r i g h t p h r a s e " ) . R few users thought t h a t the system ought t o be a b l e t o h e l p them i n the c h o i c e of terms. I wasn't specific enough in requesting precisely the thing I wanted so I seemed to get more general t i t l e s to scan through. Words should be made more flexible - t e l l you other words that mean the same thing. For example, another word to use instead of 'design'. 5 . 7 . 2 Record content and display; the recognition of relevance

There were more than 20 comments about the d i f f i c u l t y of a s s e s s i n g relevance on the b a s i s of the d i s p l a y e d i n f o r m a t i o n . Most of these are connected w i t h the Lack of s u b j e c t i n f o r m a t i o n i n the source b i b l i o g r a p h i c r e c o r d s , and t h e r e i s not much t h a t can be done w i t h o u t a new approach to c a t a l o g u i n g . Rbout 80% of the records i n the database, and n e a r l y aLL the r e c o r d s r e t r i e v e d by the s u b j e c t s , had LC5H or PRECIS headings or b o t h . S e v e r a l s u b j e c t s suggested t h a t a d d i t i o n a l i n f o r m a t i o n should be p r e s e n t e d on request "at a lower l e v e l " [ t h a n the " f u l l " record]. Once i t ' s selected a book and you've looked at i t you could have more details of what the book's about. I t gave who wrote i t and the publisher but i t would be more useful to actually have a general idea of what the book's about. Three s u b j e c t s asked f o r display: f u l l n o n - a b b r e v i a t e d t i t l e s i n the brief

I t would be better i f the t i t l e s weren't abbreviated - they should run onto the next line. Where there are Lists of reports from seminars perhaps they could be under one heading and then you go further in but maybe that's too complicated for the computer. Only one s u b j e c t complained about the c o n s t a n t s w i t c h i n g between b r i e f and f u l l d i s p l a y , a l t h o u g h i t i s l i k e l y t h a t an a p p r e c i a b l e p r o p o r t i o n of users would be c o n s c i o u s of t h i s i f they were u s i n g t h e systems f r e q u e n t l y , or i f they were u s i n g a t e r m i n a l connected over a slow network. Is i t possible to save time when selecting a book? There 'were some I could

-70-

5

Rnalysis

&

results

say I wanted just from the t i t l e - I didn't need to see f u l l details The t i t l e either told me that I knew the book or gave m sufficient information or e the author or year did. I t would have saved time to be able to select i t from the f i r s t screen. No s u b j e c t mentioned t h a t a l a r g e p r o p o r t i o n of books c o u l d be rejected w i t h o u t seeing a f u l l d i s p l a y . T h i s seems to be because r e j e c t i n g r e c o r d s was not seen as a s i g n i f i c a n t p a r t of the process of c o n d u c t i n g a s e a r c h ; choosing a r e c o r d adds i t to the l i s t which w i l l be p r i n t e d , but i t i s not c l e a r t o users t h a t i f unwanted r e c o r d s a r e not r e j e c t e d they may appear again i n a l i s t r e s u l t i n g from query e x p a n s i o n . F i v e or s i x of the s u b j e c t s complained t h a t t h e "More 1 o p t i o n o f t e n i n c l u d e d r e c o r d s which they had seen b e f o r e . I t seems c l e a r t h a t r e c o r d s not chosen are o f t e n regarded as r e j e c t e d , and the system should have some way of t a k i n g account of t h i s . 5 . 7 . 3 Problems connected with the List of chosen records

I n the e v a l u a t i o n experiment s u b j e c t s were asked t o use the systems t o c o m p i l e a p r i n t e d l i s t of r e f e r e n c e s ( 4 . 8 . 3 ) . This was the c a r r o t t o m o t i v a t e s u b j e c t s t o choose r e f e r e n c e s as r e l e v a n t , and i t i s u n l i k e L y t h a t a s i m i l a r procedure would work i n n o r m a l , l i v e use of a r e t r i e v a l system. I t had been a n t i c i p a t e d t h a t t h e r e would be i n t e r a c t i o n a l problems connected w i t h these l i s t s of chosen r e c o r d s . Rs e x p e c t e d , t h e r e were two sources of d i f f i c u l t y . I t was not p o s s i b l e t o e d i t the l i s t by i n c l u d i n g a p r e v i o u s l y r e j e c t e d r e c o r d or by removing a r e c o r d . There were s e v e r a l sessions where the s u b j e c t repeated a search t o o b t a i n the d e s i r e d p r i n t e d L i s t . Nor was t h e r e any f a c i l i t y f o r r e t a i n i n g the same l i s t over a number of r e l a t e d searches. There were t e n comments about e d i t i n g the l i s t and seven about c a r r y i n g i t over from one search t o the n e x t .

I f I say I don't want i t to remember a book i t doesn't mean that I want i t to forget i t . I couldn't go back. I might make a mistake pressing 'no'. Let m keep adding to the l i s t I've got or delete things from the l i s t . I'm e stuck with my original l i s t unless I start again. I can't do things incrementally. When I changed to a new topic I didn't really look that hard to see i f there was a way I could go back and forth between my sections or not? Sometimes you type a topic i n and there's nothing. The f i r s t question I was doing, i t was very d i f f i c u l t to find books that I really wanted [propaganda and a r t ) . There's so much about world war two and before that - that's interesting but I wanted to find some more relevant books so I had to type in some new topics and I couldn't refer back to the old l i s t I already compiled. But I would have typed i t out and had i t next to me... 5 . 7 . 4 General interaction and presentation

5 i x f u l l system users made comments t o the e f f e c t t h a t the system had too many o p t i o n s or t h a t they d i d not know where they were or what was happening. I think you need fewer options actually. I t ' s more confusing than the last one. I couldn't remember which one I was on - whether I was on the books next to i t or on the subject search or which.

-71-

5

flnaLysis

&

results

I did get a L i t t l e confused between 'more' and the shelf option. There were too many books to look at. One or two u s e r s suggested showing more o p t i o n s at a time r a t h e r than l e s s so t h a t i t was not necessary to go t h r o u g h so many s t a g e s . There were few c o m p l a i n t s about the u m b r e l l a " R e s t a r t " o p t i o n CFig 3 . 1 4 3 , but the Logs and o t h e r evidence show t h a t some users had problems f i n d i n g out how t o f i n i s h , and t h e d i f f e r e n c e between "New 1 , " E d i t " and " Q u i t " . T h i s d i d not seem s e r i o u s l y to a f f e c t t h e i r s e a r c h e s . Many s u b j e c t s d i d not use the "View" o p t i o n , and one or two were p l a i n l y unaware of i t because they suggested such an o p t i o n . Only one s u b j e c t suggested t h a t t h e r e should be o n l i n e h e l p or a d d i t i o n a l i n s t r u c t i o n s . Several suggested the use of c o l o u r e i t h e r h i g h l i g h t prompts or t o d i s t i n g u i s h between the d i f f e r e n t l i s t s of r e c o r d s . One s u b j e c t suggested u s i n g a mouse i n s t e a d of mnemonic commands. Display: c l a r i t y of screens - the way i t ' s arranged; colour screens; different typeface - direct the user to different parts of the screen; icons; use a mouse - not many people can type. Some kind of summarised key to the instructions. Is i t passible to have more colour on the screen? I know you underline the key letters but i t would be nice i f they could be in a different colour or a different print. One s u b j e c t CQE system}, an MP student i n Manpower 5 t u d i e s , might be a u s e f u l d e s i g n c o n s u L t a n t . I n response t o the q u e s t i o n about system improvements: What I did like was when I spelt 'german' wrong the computer prompted m to e have another go at that word whereas in the other system [LIBERTRS at PCU i f you get i t wrong i t ' s a l l wrong and you have to start from scratch. I like [the input box] - i t focuses your eyes. I f there are any shortcuts which can speed up the process when you make an error so you can switch out and come back in again s t i l l keeping a l l the material you've used so far. I didn't really spot anywhere where you could speed i t up in that sense. The instructions - i t ' s d i f f i c u l t when you've only got a single colour. I f you've got colour you can bang them out at people. I noticed instructions were underlined and highlighted but i t seemed a l i t t l e bit packed to m e perhaps i f they were spaced out a bit more you could discern them more quickly. Another minor improvement, although i t ' s d i f f i c u l t i f you're using windows, is getting instructions in the same place. When you're using these systems you're using books and notes too so your eyes are going away from the screen. I f everything's in the sane place on each manoeuvre then you're saving a hell of a lot more time. I'm not sure i f i t ' s possible using windows.

to

5.8

How people searched t h e systems

No i n t e r a c t i v e system can be assessed w i t h o u t seeing i n d e t a i L how people r e a l l y use i t . F u l l t r a n s c r i p t s of s e s s i o n s are e x t r e m e l y l o n g . Rppendix 8 c o n t a i n s an i l l u s t r a t e d summary of one search and a commentary on a n o t h e r .

-72-