•tS^U^v U ^ ^ tLX.^fc REPORT ON THE NEED FOR AND PROVISION OF AN 'IDEAL' INFORMATION RETRIEVAL TEST COLLECTION K. Sparck-Jones C J Van Rijshergen University Computer Laboratory Cambridge December 1975 Preface The preparation of this Report was supported by a grant from the British Library Research and Development Department. The Report was used as a discussion document for a Workshop held on December 11-12, 1975. The authors are very grateful to the Workshop participants for their stimulating reactions to the Report, and for their favourable response to the suggestion that an "ideal0 test collection of the kind indicated should be constructed as a material aid to retrieval research over a wide area. V7e have not attempted to incorporate the many comments made into the final version of the Report, as this would have effectively required a wholly new document. We are instead issuing the Report as prepared for the Workshop with only minor clerical corrections, in the hope of receiving further comments from other potential users of the 'ideal" collection, which could be input to a more detailed design study for the collection construction. K.S.J. C.v R. December 1975 1 O. SUMMARY OF REPORT, .AND RECOMMENDATIONS Summa^ This study a) investigates the need for an ideal test collection(s) for information retrieval research; h) discusses the requirements it should meet; c) outlines the characteristics it should have; and d) considers the administrative implications of setting it up and maintaining i t . T h e study is in three parts: 1. deals with the need for, and properties of, the ideal collection(s); 2. deals with organisational aspects; 3. deals with (roughly) estimated costs. The Appendices provide details of existing collections. Our conclusion is that there is a genuine need for a well-designed multipurpose test collection. The least collection satisfying these needs would consist of a large document set with core propertiesr having several small, enriched, subsets, and a number of associated collections comparable with the subsets in size and having other properties. Higher-grade ideal collections would provide more alternatives, large and small. At least some of the basic material could probably be obtained from existing services or projects, but this would certainly have to be supplemented. The ideal collection(s) could be set up by a one-off project, but it must be maintained and made available to research workers, and some person or organisation is required to do this. The collection itself should hopefully allow a large range of uses, and while the primary intention is that it should be made available to different projects, it could also benefit research through in-housc exploitation by the holding organisation. Very rough cost estimates suggest that the collection could be set up for between £25K and £ 10000 documents are needed for some purposes re requests: < 75 requests are of no real value 250 reauests are minimally acceptable > 1000 reauests are needed for some purposes. reasons: real collections are large statistically significant results are desirable scaling up must be studied (Note that request and document set sizes are not necessarily correlated). 8 1 various in content documents and requests should cover a range of subjects of varying content and 'hardness9 e.g. science, social science, news reasons: real collections are heterogeneous consistency of devices must be tested by comparison 2 homogeneous in content documents and requests should cover one subject intensively reasons: real collections are homogeneous discrimination of devices must be tested by exhaustion 3 various in type documents should be of different types e.g. popular, specialised, survey, review, patent; requests e.g. broad, narrow reasons: similar to 2.1 4 similar in type documents and requests should be of the same type reasons: similar to 2.2 5 various in source documents should cover a range of journals and journal types reasons: similar to 2.1 6 homogeneous in source documents should cover one or a fev; similar journal types in depth reasons: similar to 2.2 7 various in origin documents should represent different author origins and status; requests should represent different users and needs (link relevance) reasons: similar to 2.1 16 Collection specifications ^• To satisfy general requirements for set of documents, requests and relevance j udgements a) substantive There should be at least 2 large collections for an subject area respectively, science These should ezch cover variations in conL;er?.b, type, Source, origin, time and language. They should profex.~L\Vy be i } c n from an operational system, i.e. both -;; documents and i e c ' - t - 3 ; n . d be thos taken, and accompanied by genuine r^e.ifj c o J . user reJ evance ; Lv^-ir^'its, j It should be possible to cxt.ra/jc from each of the large collections one or more small subcollectioi;-; which are homogeneous with respect to conte.it, type, source, origin, time and language. The subcollecflous should be operational as far as possible, but it is highly probable that some requirements e.g. for alternative indexing etc. can only be met by dvsycjn. b) formal The large collection:-; .shou?o. he various in formal properties, and it should be possible to extract homogeneous subcollections. B • To satisfy requi.r:-j:n^r>ts csty:err:.ing individual documents, requests and relevance j u g r . i r - :; d e j j . L ,: The detailed specification of core and enriched properties for the ideal collections primarily refers to individual documents, requests, and relevance judgement':-:, rather than sets of documents, etc. Our choice of properties from the full lints of pp./(to which the numbers refer) is as fo1lows t 11—13 C = core; E = enriched 17 Documents Documents in large collections should be represented by 2 abstracts 3 titles 4 free keywords, from abstracts 6 citations ) ,, . . . , . - . , . _,,.,,, ,, , . abbreveiations to be avoided 7 author and bibliographic elements) 8 thesaurus or subject indexing, if available Indexing should be by one simple indexer, and one expert, for 4 by one expert, for 8. Documents in small collections should be represented core plus 4 free keywords, from text, title, to different exhaustivity 5 free sentence 8 thesaurus if not in core, and other controlled languages as available. Indexing should be by various simple indexers, and various experts, for 4 by one simple indexer, and one expert, for 5 by various experts, for 9 Requests Requests in large collections should be represented by la) verbal text b) coordinated terms c) Boolean formulation (which could consist of a cumulative log of an on-line search) Indexing should be by one user, and one expert, for (b) and (c). Requests in small collections should be represented by core plus Id) e) f) 2 3 terms with weights edited forms of b, c, d modified forms of b, c, d source documents verbal text from, source documents. The total record of any request eliciting procedure should be preserved. For example if a user is asked to mention appropriate known documents, these should be indicated. Note that while some experiments relevant to on-line searching could exploit requests formulated during previous on-line searches, in other cases new searches would be required; for the latter the ideal collection(s) would provide an adeauate set of documents. 18 Relevance judgements C Relevance judgements in large collections should allow 2 grades (highly, partially) one user, and one expert, as judges. It is unlikely that exhaustive relevance judgements could be made for large collections; however some attempt must be made, e.g. by additional searches, to estimate recall. E Relevance judgements in small collections should allow core plus more grades, types various users and various experts. Collection reco^se• \dxi\'-ton;; The general problem is that these ideal specifications may have to be tempered by realism. The exact way in which the requirements lis tedabove can be met must be determined to some extent by what is available in operational systems. A specific problem is that while some data may be available in an operational system, it may not be in machine readable form. In general one night hope to extract material with most core properties from an operational system, but keying of items like abstracts must be allowed for. .M.u^h of the data for enriched collections would have to be specially supplied. Clearly, any project to set up the ideal collection (3) would have to have an initial phase for a detailed study of data sources. Our specifications suggest the following as useful but realistic collection sizes: large 1O-30O0O documents; 500-2000 requests medium 2™. : ) O .Oo " small 500- iOOO " 200- 500 The main problems are clearly those of satisfying the core requirements for the large document sets which are needed for some purposes; and of ensuring that small collections are experimentally valid while not making them too large for the capacities of independent projects which might contribute them. Similar recommendations are needed for numbers of collections. In particular, since choices of property requirement can be combined in different ways, it is convenient to distinguish three grades of ideal collection(s) which might be built. They are 1 2 3 best, accept.able, and least. Which grade is achieved is determined largely by the ease with which enriched property descriptions can be supplied by operational systems. It is likely that many will have to be supplied by design. In detail, the grades are as follows: 19 - kest * - 2 large collections, each with 3-5 subsets having substanial property overlaps, of which 3 are designed and 2 are selected subsets. 3-5 other small collections to complement these. 2 large collections, each with 3 subsets having some overlap, of which 1 is a designed and 2 are selected subsets. 2 other small collections. 1 large collection, with 5 subsets having some overlap of which 3 are designed and 2 are selected subsets. 2 other small collections. ^ acceptable 3 least These subset specifications do not include ones which could be selected by purely clerical operations, e.g. ones representing all the articles from a specified journal or requests with the same number of terms. Some such selections could easily be made initially, others to order. Even the least collection would be of great value to research, particularly if it was supplemented by collections from other projects, especially if these were of the good quality which might be achieved by 'bulge1 funding. In addition, if the collection was primarily extract.d from an operational system, it might be encouraged to grow through the operational system. This would clearly be a very satisfactory way of meeting many ideal collection needs. Collection form As noted earlier, it is intended that the ideal collection be machine held. This means that the main collection data is machine held, and in a convenient form. Referring to the categorisation of collection levels on p.4 , it is clear that some real information , like full document texts, could hardly be keyed; but it should preferably be held in microform. The material collection must, however, be keyed. It must be supplemented by adequate backup information and documentation a) characterising the content of the collection and how it was set up; and b) detailing any processing applied to bring it from level 3 to level 4, and its format at level 4. It is perhaps not reasonable to require that other projects, even when funded by 3LRDD, should provide total information about deposited collections. But these collections should meet some minimum standards of content and format, to make them sufficiently comparable to the corecharacterised ideal collection (s); and they should be suitably documented. In principle collections set up by projects not funded by BLRDD might be of value to supplement the ideal collection(s); it might of course not be possible to obtain such material in the desired form, but even rav; magnetic tapes and primary documentation should be sought. 20 2. ORGANISATIONAL ASPECTS There are two questions here: 1) the ideal collection (s) must be set up, by someone whom the Builder; we will call 2) the collection (s) and possibly other generally useful ones must be kept so that they can be made available to research workers, by someone we will call the Curator. These are distinct activities, so Builder and Curator need not be the same person. The division between their concerns comes with the provision of the ideal collection (s) at formatted level 4; this could be either the final stage of the builclJ7>g project, or the first stage of the curating one. We emphasise that there is little point in setting up the ideal collection (s) unless proper management and maintenance is provided for. Some organisation is required even to provide tape copies of the level 4 formatted collection. But v e believe that the Curator could have the more / positive function of stimulating research through the use of the collection (s). It will be clear that both setting up and maintaining the collection (s) are non-negligible enterprises. The implications of the specifications outlined in the previous section, and possible ways of setting up and maintaining the collection (s) are discussed below. We necessarily assume that funding sufficient for the least ideal collection(s), is available. The higher grade collection(s) of p!9are clearly more attractive to the research community, but we should not claim ".hat they are necessary for the well-being of the community. In particular, since the cost of setting up collections involving different primary document sets must be largely additive, the grade to be chosen depends primarily on BLRDD's willingness to provide funds. We think that a very good case can be made for BLRDD's supplying the least collection, both to reduce the cost of individual projects and to promote the research that information retrieval needs; and since managing this collection and supplying it to users is a not wholly trivial task, some committment to the future maintenance of the collection from BLRDD is also reauired. The Builder We do not think that this is the place for nit-grit recommendations as to exactly how the ideal collection(s) are to be set up. This will depend in part on the level of funding, and in part on how far suitable input material exists in current operational systems or has been assembled by research or development projects. However we feel that the general approach to setting up the ideal collection (s) is independent of such specific considerations. The important points are as follows. Even if ideal collection building is funded only to achieve the least output, the degree of control required, and effort involved, are considerable. 9 2.8 homogeneous in origin documents and requests should represent one kind of author and user reasons: similar to 2.2 2.9 range over time documents should be of different dates; requests should be of different dates both for different users and the same user reasons: similar to 2.1 2.10 coincide in time documents and requests should be contemporaneous reasons: similar to 2.2 2.11 various in natural language documents should be in different languages (or at least their titles should, in which case translations should be provided) reasons: similar to 2.1 2.12 homogeneous in natural language documents should be in one language reasons: similar to 2.2 Globally, it should be possible to use the ideal collection(s) to investigate or simulate retrospective searching i.e. one request against all documents; SDI searching i.e. a repeated request against successive document sets; i.e. a modified request against some or all documents; i.e. a request, modified request or set of requests against multiple document sets» iterative searching multifarious searching in It should also be possible to use the collection (s)/studying the. interfaces between components in a mixed system incorporating, for example, data retrieval, fact retrieval, document retrieval and computer-aided instruction. This mav be called hybrid searching. 10 ii) formal requirements of document, request and relevance judgement sets. 1 documents and requests should be variable in real length material length (i.e. index source length) index length reason: to test consistency 2.2 documents and requests should be homogeneous in real length material length index length reason: to test discrimination It is assumed that appropriate parallel substantive and formal properties of relevance judgements will follow naturally if the above specifications for document and reauest sets are met. 11 General requirements re individual documents, requests and relevance j udgonents. i) substantive, re documents Document representation It should be possible to use or study 1. full text (this should be preserved even if not keyed to allow for future new indexing, linguistic studies and questionr answering experiments) abstract a) b) a) b) as is all non-stop keywords, stemmed as is all non-stop keywords, stemmed 2o 3. title 4. free extracted keyword or keyword string indexing 1) 2) 3) from full text ) from abstract from title ) ) a) as words b) stemmed where in general if 1 has exhaustivity x 2 has exhaustivity x and also y>x 3 has exhaustivity x and also y and also z>y 5. free quasi-extracted sentence, i.e. a single unit sentence incorporatin* extracted keywords citations a) b) in full detail in short code 6. 7. 3. author and other standard bibliographic details controlled indexing, including broad subject codes 1) using any standard existing thesaurus for the field (and or classification, as many as readily to hand) 1 from abstract 2 from title 9. probabilistic indexing (using keywords) 10. usage statistics 12 Document indexing Indexing should be carried out re 4 a) by a simple indexer; by one indexer at different times;by different indexers b) by an expert (also perhaps re 5) re 8 by an expert by expert consensus r " V B : » by expert consensus Request representation It should be possible to use or study 1 verbal (given the same source text request) a) b) c) d) e) running text simple coordination formulation full Boolean formulation terms with user weights edited after consultation forms of the above i.e. pre search, with librarian f) modified forms of the above at end search, with recorded history of subsearches, changes etc. i) off-line ii) on-line 2 source document as request 3 verbal (as above) from source document a) where source document is relevant b) where source document indicates area of interest but is not necessarily specificallv relevant Request indexing Indexing should be carried out 1 by user; by user at different times 2 by expert; by expert at different times? by different experts, by expert consensus This indexing may be done with a specific relevance need in mind; if so, this should be indicated with the query. Any other germane background information should be recorded. 13 Index language Ensure available, i.e. preserved, even if not used, if relevant language exists at time collection(s) is set up. 1 • 2 3 thesaurus classification switching language Relevance judgements Ideally these should be exhaustive. But if not some attempt should be made to carry out independent searches using any available information and device, to obtain a pooled output for more broadly based relevance judgements than may be obtained only with simple user evaluation of standard search output. In this case some estimate of the recall sample should be attempted. It should be possible to separate 1) grades 2) types e.g. highly, fairly e.g. novel, stimulating of relevance judgement. Judging should be done by 1 one user; one user at different times; one user specifically sequentially 2 one expert; one expert at different times; several experts; expert consensus Exclusions The following 6o not seem 1 2 to be called for: 3 4 5 books as documents 'non-literary 1 items for documents e.g. technical record specifications, simple data records (e.g. stock, personnel) verification-type requests e.g. for publication dates material in esoteric character sets legal data 14 Other collections The provision of a new test collection (s) , even if ideal, will not make existing collections redundant. This is in part because a good deal is known about some existing collections, so they may be useful test beds for new ideas. Some may also be of value for making comparisons with the ideal collection. It must also be recognised that the ideal collection(s) is unlikely to meet every research need, and that future collections associated with specific projects may be created. Thus in the future we should allow for a) some further comparison between existing collections; b) some comparisons between existing and new collections; c) some comparisons between existing collections and the ideal collection; d) comparisons between new collections and the ideal collection. and These projections imply that steps should be taken to relate new collections, in particular, to the ideal collection (s). They should be regarded as a means of extending the ideal collection(s). The ideal collection(s) It is obvious that the listed requirements for the ideal collection (s) are considerable. In some sense they cannot be provided within a single collection, unless this is no more than a mere aggregate. The following pairs of collection requirements are particularly important: 1. The need for sub and super collections; 2. The need for one and several collections; 3. The need for operational and designed collections. Thus experimental needs are in fact for different collections which can be related to one another, and which have specific properties. Realism suggests that it may be impractical to seek to ensure that each such collection has the maximum set of (compatible) properties (e.g. all variations on the relevance judgement theme), and further that it is unlikely that such collections with all the requisite properties can just be pulled out of operational retrieval systems. It appears more practical to think in terms of large, not necessarily completely characterised collections, with richer small subsets, selected as far as possible from operational systems, but supplemented where necessary by deliberately designed information (e.g. further sets of relevance judgements, judex descriptions etc.). The former have 'core1 properties while the latter are 'enriched*. 15 This suggests something like the following will turn out to be needed: all large medium small collections respectively to be of comparable size in numbers of documents and requests The following sections work this scheme out in detail Core and enriched forms of collections 'Core1 refers to essential properties possessed by all ideal collections and subcollections; "enriched' refers to additional properties. Some core property requirements are readily satisfied even for large collections: the problem is to specify a set of core requirements which are both useful for retrieval experimentation and realistic for large collections. Some enriched property requirements are very exigent: it is perhaps unrealistic to suppose that all compatible ones can b e satis fied for every subcollection; on the other hand it would be nice if different subcollections of a large collection had more in common than their all being subsets of the same set, with core properties. If possible, some overlap in enriched properties should be provided, to allow for valid comparisons and extrapolations. 21 This suggests an experienced head a project of Ih - 2 years a cost of £25-30k (ball-park figure). We see the project as having three phases: 1) Design study, to by carried out by the Builder as an initial short investigation. This would survey existing operational or experimental services/ and also test collections, to see how they might be exploited to provide input; and it would discuss the mechanisms for collecting the detailed data, with specific cost estimates. 2) Data assembly This would involve the extraction and bringing together of data from services, and the provision of new data, e.g. alternative indexing, relevance judgements, etc. 3) Machine input This would include keying the raw input material and applying any appropriate basic transformations to material already in machine readable form. The boring but non-trivial job of raising this level 3 material to formatted level 4 would be done either by E-.dlder or Curator, according to the resources available. 4) Documentation This w-'?uld cover a full account of the source material and the way it was collected, with notes on the keying conventions• Level 4 processing if done would require documentation to match. The most important requirement of the Builder is that he should be experienced in setting up and using test collections. It would clearly be ideal if Builder and Curator were one, but this is perhaps too much to hope for. If they are not, it is most important that there should be adequate liaison between Builder and Curator, perhaps in phase 2, and certainly in phases 3 and 4. A suitable mechanism might be to have the Curator as a consultant on the building project. Since bringing the collection up to level 4 could be done under the maintenance project, it is not necessary that the Builder have direct access to powerful computing facilities. Keying of raw data could be done by a bureau, and basic transformations of material selected from machine-based systems could be done either by the supplier under contract, or a bureau, or by the Curator. Since willingness to do the job, and the necessary experience, are more important than anything else, we do not feel obliged to specify the Builder's locale. He could be an independent research worker; a member of an existing retrieval service organisation; a member of a consulting estafilishment like Aslib; on the staff of, or associated with, BL. 22 The Curator As mentioned above, a minimal view of the Curator's activities would imply that he did no more than hold the established ideal collection(s) and distribute magnetic tapes and descriptive documentation. However other activities for the Curator are implied by the suggestion in Section 1 that the ideal collection(s) might be supplemented by other project collections. The Curator's brief could therefore include the following: 1. Maintaining and distributing the ideal collection (s). . 2a.Obtaining existing reasonably solid test collections, if necessary vamping up at level 3 and processing at level 4. b.Acquiring new collections from individial projects, particularly if BLRDD requires or encourages deposition; if necessary vamping and processing. 3. Carrying out (documented) benchmark retrieval runs; gathering basic, e.g. statistical information about collections. In terms of day to day operations these activities would imply: . . 2. 3. 4. 5. holding, over a long period; obtaining, and vamping/formatting; clerical processing e.g. of magnetic tapes; providing and distributing documentation and advice; carrying out simple experiments and counting. Effective curating over this range of activities would require a Curator experienced in both retrieval work and computing, and fairly powerful machine facilities, and would depend on relatively long-term support at the appropriate level. But it must be emphasised that the ideal collection would probably have a long life, so a long-term commitment of funds, even if maintenance is only on a low level, is needed. Again, it is rot for us to recommend a specific organisational setup for the Curator. The following are alternative possibilities for BL: 1. entering into a non-personal contract with a computing service (commercial or university), for the provision of tape copies, etc; or a similar contract with a retrieval service; 2. establishing a personal Curatorship attached to a Library School, Computing Department or Retrieval Service; 3. establishing a curating project with specified Curator, attached as in 3, with the intention that this should act as a focus of research; 4. setting up an institution with Curator, with intention as 3. The first the ideal or ensure expensive of these would almost certainly not promote the fullest use of collection(s), aid the assembly of other collection material, benchmark testing. The fourth is objectionable as very and liable to be a white elephant. The second and third 23 alternatives seem the best bets. In particular they would promote the use of the ideal collection (s) as a focus of research, and hopefully prevent the mere accumulation of dead material. Assuming a more than minimal tape-copying service, these alternatives would imply something like a half-time Curator and half- or probably full-time programmer of some calibre, with suitable support, i.e. an annual cost of between £5K and £10K = £50K over five years ( not including machine time). A deliberate attempt to encourage extended research using the collection(s) in association with curating would imply higher costs. Advisory panel Since building and curating the ideal collection (s) are significant projects, we advocate a panel or steering committee with the following functions: a) advising the Builder and Curator on project operations; . b) maintaining technically acceptable standards of data management and distribution; c) encouraging collection use by advertisement; d) vetting proposed uses; e) ensuring general continuity. 24 3. COST ESTIMATES The cost estimates given in the following pages should only be taken as ball-park figures. The difficulties in giving accurate estimates now, are 1. Insufficient data on which to base estimates/ 2. Data out of date (1971), 3. For some parts commercial rates will not apply. The figures used are mainly based on a report by Peter Vickers (1974). There is an additional difficulty in allowing for inflation. Although in general a 30% increase may be applicable to the costs quoted we have not adjusted them for the simple reason that in some cases the cost (e.g. computer processing) has gone down. Rather than try and estimate the trend of the cost associated with each item we have stuck with 1971 prices. We also give the raw data on which our estimates are based (taken from Vickers, 1974) in Appendix C. We only give detailed costs of the Building phase of the operation since the costing of the Curating phase depends heavily on what is actually implemented. We do however list some of the major factors determining the cost of the Curating phase. We ignore the cost of housing the projects and the fact that some of the costs may be borne by separate small projects. Building (all figures in US dollars; halve for pounds) The reason for giving most of the costs in US dollars is that we wish to maintain comparability with the figures in Appendix C. a. Documents cost of buying a data-base of some 50000 items from an operational system. Tape with citations + descriptors + abstract Low-level reformatting 750 1500 2500 1000 25 Data prepration at .05 cents/char. If we have to keyboard per item 500 chars (e.g. index terras) 1000 chars (e.g. abstracts) 2000 chars (e.g. everything) Proof-reading is about half the keyboarding cost 500 chars 1000 chars 2000 chars Equipment is about half the proof-reading cost 500 chars 1000 chars 2000 chars Total 500 chars lOOO chars 2000 chars Computer processing of input at about 0.33 dollars oer item 21875 43750 87500 13125 26250 52500 4375 8750 17500 312.5 6250 12500 1875 3750 7500 625 1250 2500 6250 12500 25000 3750 7500 15000 1250 2500 5000 50000 docs. 12500 25000 50000 30000 docs. 7500 15000 30000 10000 docs. 2500 5000 10000 16666 10000 3333 We estimate some of the costs associated with generating small enriched collections 2QOO docs. Cost of indexing 2.5 - 5.00 per item Cost of abstracting 1.5 - 6.5 " " Cost of acquisition of full text 5000 - 10000 3noo - 130CO ? b. Requests We assume that the requests will be collected during a bona fide use of an operational system. Therefore the cost per query will be mainly that charged by the system. One could estimate 5 - 1 0 dollars for this. Note however that corresponding to every information need we may have to run 5 - 10 formulations to estimate the relevance set. c. Relevance judgements In general one will have to assume that by providing a service free of charge to a user he will in return provide relevance assessments. The exhaustive assessments of small subsets will have to be costed separately 26 e.g. Acquisition of full text Mailing Clerical ? ? ? d. Cited references One of the core requirements is that each document should have as part of its representation the references it cites. Unless this comes with the representation extracted from the operational system the cost of obtaining this further information will have to be estimated separately. The most likely source of the cited references are the ISI tapes for the appropriate period. e. Generating a collection at level 4 At this stage it is not possible to say whether level 4 should be created by the Builder or the Curator. However this decision will mainly affect the apportioning of costs between the two phases. If for the moment we assume that creating at level 4 is done by the Builder then we will have to allow for extra machine time, file storage, documentation costs. f. Personnel Builder ) Programmer) up to £8000 p.a. Library and clerical support £2000 p.a. Travel (particularly in the early stages of the design study, see p. 21) Adminstration (e.g. mailing, xeroxing) The reason the cost of the Builder and programmer have been lumped together is that to some extent there exists a trade-off between them. If the Builder is experienced computationally he would not need a very experienced programmer. On the other hand if the Builder is not acquainted with the computer technology his programmer will have to be of a higher standard. Also, it may be that if creating at level 4 is left to the Curator the building project would only require a half-time programmer. 27 Maintenance and Distribution To some extent the costs of this operation will depend on the demand for the data. The operation should be costed over 5 years following the building phase. The main cost factors are Curator (half-time?) Programming support (h programmer?) Equipment (e.g. tapes, terminals, etc) File storage Computing time Travel Advertising Clerical If the ideal collection(s) is to be added to over a period of time then the costs of the building operation will be applicable here pro rata. 28 Appendices A. B. C. Do Details of British test collections Summary of non-British test collections Costs table Standard c o l l e c t i o n format References 29 Al CRANFIELD 2 Project name Factors determining the performance of indexing systems. Objectives 'to deal with index language devices..(with)., precise measurement of recall and precision ratios'. To carry out a laboratory test, following up and improving on Cranfield 1. Chief person/Reference Cleverdon et al, 1966 Size 221 queries . n_ TA ^ j * several subsets, especially 1400 documents Subject Aeronautics. Indexing source Full texts; also abstracts and titles/titles. Index languages 3 types, in 30 forms, all applied manually: single terms; with synonym grouping; with hierarchical reduction simple concepts; with hierarchical reduction controlled terms; with related terms. Abstracts and titles/ titles indexed automatically. Requests Authors of selected recent papers asked to state reason (in form of a question) for undertaking research leading to paper, and to provide other questions related to this research. Relevance by authors, for own cited papers: exhaustively by experts to obtain additional papers for author vetting. There were four relevance grades.* Relevance judgements were based on full text. Document collection Document set consisted of some recent papers, and their cited references, with some others. Present state of test collection Queries and single term index descriptions, abstracts and titles, with relevance judgements, available at level 4. Other users of test collection Sparck Jones, van Rijsbergen, Salton and SMART Project workers; also Minker, Svenonius. (Some SMART tests with 24 or 155 queries and 424 documents) . 42 queries . n^ ' 200 documents. *n relevance grades does not include non-relevance as a grade. 30 A2 INSPEC Project name Comparative evaluation of index languages. Objectives A comparative assessment of the retrieval performance/ in the INSPEC system, of a number of index languages which might be used as the sole or main means of subject manipulation. Chief person / Reference Aitchison et al, 1970 Size 97 queries 542 documents. Subject Physics, electrotechnology, and control. Indexing Source Abstracts and titles/titles. Index Languages 1. Titles " „, , . .., ) not normally regarded as an index language J 2. Abstracts and titles 3. Printed subject index to Science Abstracts + free language modifier line 4. Controlled language using a thesaurus 5. Free language terms (applied by the SDI investigation staff to indicate 'subject content of document1 before translation into 4 ) . 3-5 applied manually. Requests Questioners were asked to ensure that the questions were within the scope of their SDI profiles ('it will need to be answerable by some of the documents already notified to you by the SDI service'). Only questions with at least one document at the higher level of relevance were used in the evaluation. Queries screened by research team 'if detailed study of the profile showed that its scope had been changed in the course of the four SDI services' query would be discarded). Relevance Each questioner to be sent for assessment only those documents which he had previously assessed as relevant to his profile. Relevance assessments made by the user on the basis of document texts. There were two relevance grades. Document collection 2/3 of documents relevant to some query. Preconditions 1. SDI investigation was in progress 2. Queries were solicited from users who received all four services and had assessed at least 12 documents as relevant to their profile. Present state of test collection Queries and free language index descriptions, with relevance judgements, available at level 4. 31 A3 ISILT Project Name Information science index languages test. Objectives 'to compare the effectiveness and efficiency of different index languages as used in subject retrieval systems'. Chief person / Reference Keen et al, 1972 Size 63 queries 800 documents Subject Documentation Indexing source Abstracts and titles/full texts Index languages 5 kinds all applied manually 1. Comnressed term language - 300 terras from ASLIB + related terms added. 2. Uncontrolled - natural language text words underlying hierarchical index terms of 3. Specific indexing was followed by redundant indexing. 3. Hierarchically structured language - post-coordinate. 4. Same as 3 but pre-coordinate 5. Relational indexing Exhaustivity and specificity of indexing were controlled. Requests These were miscellaneous real requests, formulated considerably later than the dates of the documents. Relevance Exhaustive relevance judgements were made based on abstract and title for 408 of the documents; for the rest full text was used. There was a scale of relevance. The assessment was 'non-user relevance by simple subject experts who were not requesters, indexers, or searchers in the test'. Document collection set I...408 documents from the Smart project. Abstracts from the period 1961-6 3 were available in machine readable form. These abstracts claimed to be bad* set II..392 good abstracts dated up to 1968, some of these were quoted as known relevant ones by requesters. 2/3 of the collection in fact relevant to some request. Present state of test collection Queries and uncontrolled index descriptions for whole collection, abstracts and titles for Subset- I, with relevance judgements, available at level 4. Other users of test collection Sparck Jones, Van Rijsbergen, Horsnell (Some tests with Subsets I or II .with automatic indexing from abstracts and titles/titles for Subset I ') 32 A4 UKCIS Project name Retrieval experiments based on Chemical Abstracts Condensates. Objectives 1. To gain experience using CA-Condensates tapes, 2. To compare the relative effectiveness of searching titles only, titlesplus -keywords, and titles-plus-digests. 3. To measure the variation in performance between profiles covering different subject areas. 4. Investigate automatic profile construction. Chief person / Reference Veal / Barker, et al, 1974 Size and Subject 193 requests (subset 48) documents size subject CAC-1 11518 Biochemistry, organic chemistry CAC-2 15629 Macromolecular, applied and physical chemistry CBAC 1568 Biochemistry POST-J 1412 Polymer science POST-P 1442 Polymer science Indexing source Full texts/titles Index languages Keywords applied manually; titles and digests effectively indexed automatically. Requests Formulations were from SDI service users and were current for the documents searched. Different versions were written for different data bases. Relevance Assessment of output (pooled if appropriate) by users, usually from titles and digests, sometimes titles only. There were 2 relevance grades, and sometimes 2 non-relevance grades. Document collection Documents were taken from Chemical Abstracts Service files. Present state of test collection CAC-1 and"CAC-2 available at level 3 or 4. Other users of test collection Sparck Jones, van Rijsbergen 33 A5 MEDUSA Project name Medlars on-line search formulation and indexing. Objectives To compare the standard method of search formulation by a trained search editor with a physician's using an on-line terminal. Chief person / Reference Barraclough/Barber et al., 1972 Size 58 queries 51000 documents Subject Medicine Indexing source Full texts Index language MeSII, i.e. controlled language, applied manually. Requests Requests from real on-line systems users, with two formulations;, one by the user and one by a trained search editor. Relevance 2 grades of relevance, also 2 grades of non-relevance. Assessment of output based on citation and indexing, by user for both search formulations. Document collection.: Documents ta)~en from monthly files of regular Medlars service. Present state of test collection Available at level 3 Other users of test collection 34 A6 NPL Project name The National Physical Laboratory experiments in statistical word associations and their use in document indexing and retrieval. Objectives 1. To develop methods of clustering words on the basis of especially computed measures of association between word pairs. 2. To explore and evaluate ways of employing these clusters and associations to improve performance especially in the ability to recall relevant material. Chief person / Reference Vaswani, 1970 Size 93 queries 11571 documents Subject Electronics/ computers, physics, and geophysics. Indexing source Abstracts and titles Index languages A dictionary of 1000 index terms (stems) was constructed based on a sample of 1648 abstracts by semi-automatic means. These were used to index the documents. 1. Weighted i u, J Unweighted terms 2. Clusters derived from associations 3. Expansion through a connection network Requests 20 people formulated requests based on source abstracts; but these only specified subject and abstract not necessarily relevant. Relevance 17000 relevance decisions made by the people who formulated the requests. Report claims that 80% relevant documents uncovered by various strategies. Document collection Set from published abstract journal. Present state of test collection Other users of test collection 35 A7 UKAEA/NSA Project name SDI from Nuclear Science Abstracts Objectives A study of the relative performance of two computer matching techniques: (a) of Euratorn indexing terras aid (b) words in titles. Chief person / Reference Olive et al, 1973 Size 60 queries 12765 documents Subject Nuclear science Indexing source Abstracts and titles? Index languages 1. Natural language 2. Euratom index terms 3. NSA subject categories 2 and 3 applied manually. Requests Formulations were based on SDI service users' interests; Relevance User assessment of search output based on title, bibliographic elements and assigned index terms. There were two relevance grades and an option to state that abstract was required to make relevance decision. Document collection Documents taken from successive issues of NSA used service Present state of test collection Available at level 3 Other users of test collection in a regular SDI 36 Bl Test collections used in SMART Project tests published by Salton and others from 1968 are: Requests Documents 82 375 375 780 200 200 200 424 424 424 424 1400 1400 1400 1268 468, 1095 273 450 450 1033 852 Documentation Computing ADI IRE 35 17 24 34 42 36 22 22 24 30 155 36 50 225 48 48 18 24 29 35 29 Cranfield Aeronautics Ispra Documentation Medlars Medicine ('Pphth^ilcKDdicrcjy) Time 03 21 425 425 World Affairs These collections are automatically indexed from abstract and title (but ADI from short full texts) ; some have indexing derived from a manual thesaurus; the Medlars collections MeSK indexing is not held. The collections are presumably available at something like level 4. Recents tests have mainly exploited the Cranfield 24x424, Medlars 24x450 and Time 24x425 collections. For relevant information see Salton 1975a and b. m rms from d voc rom t from abstra strac 11 te ce •a -H X c 144 trol nual titl cted yword ject X) :3 >i rH rH a) CH p 0 bib •a = = - C O CJ -H rH rC C O ned rus W -H P 0, •H o G C7> •rH C O C O fd >i rH cH fd 3 rd M C O X 03 -P C D rX co T3 X3 CP ^J P M () en CJ 3 fd *—? o H v-> X W •d u CJ Q » r-H 4J -rH CJ •d C O TJ rH H IH C O M 0 G^ G •H 0 rd s • x.x CJ p , to i rd -rH HH •rH C O C O fd CJ rH rH X CJ G 4J 4J o ,V 0 CJ o -H GH -rH C O C O Cn X> H CJ CJ tura hing alys tion o -H 0 •H C O -H rH lang G X! G a c D> rd 3 p -H > § a 0) CD l*H 4J >. C O T3 G H GH G 0 a fd G -H -P rd •H O fd tr G -H H G) 4J C O rH O O -H GH -H C O 10 fd rH C O (J Cn G 'H Cn 4J fd -H JH CJ 4-> 10 rd rd 0, CJ 0 P G) rH 144 O' c 3 4-> CD -H rH (D Cq X r-H O rd r-H rd r-H O m 0 G G C O -H rH fd z z. M C D 4-> C O rH O G G O C O -H rH fd Oi a s 3 rH •H 4-3 fd ^5 rH c G G MH u ties and subj hea •H X o n CD 4J U G) C O Cn G ^ JH CD C O -H 4J *H (^ a) 0 C O C O CJ O C O C O G) CD -H 4J U P c O 3 O O 0 1 G P P B w a 0) g 3 u o Q a P CX 3 rH CJ p s X CJ 0 u p P u C O U 4J rd CJ •f~\ > u G) > G G) -H 4J P G 0) 3 C O C O fd TJ E H O a 0 z z jG - G l CD u 0 Q rx a c u rH CJ EH 6 a 0 u & n3 H 0 G u CJ H B p O I « 3 0 0 u rd £ U 3 = C O H r:> r-\ > O 03 a, C O 0 > O G) u a a «' O to -- Q T3 Qj GH G4 u P C D CD GH GH Si w c n W tics rH eld try -H U CO a ^ 0 cd r ^ G -H 0) 0 C O C O u '3 0 0} < s CO rC CO O 0 S • & B 0 p 0 0 rj -H L; C D G U -H -H GH 03 •H n CO ^ CM < c u 0 0 0 r> c < a c CN 4J r< 'n. G fd y 1 »rj QJ ' G 0 r ^ B H Ci f" C O rH rH C O eld ics - '-O -H C O >1 rH ro . 3 M •H rd O ^ O in CM u w 0 0 w rt H CO n^ 0) W K H m G rd s m f> CM a. rH CO CM § CM U C O rH CJ rH IDC D CO u -r;f CM S \4 G n: w 0 CO ft CJ 0 CD Oi C O ^H <: u « G G 1 0 0 C O 4J 0 Q I G G) rH CM OD ^ 0*1 UJ <-K a^ r^ -:r 0 CN ro r> LO rH r> ex? OJ UJ rCvJ rH CM 8 CM CD O LI CO ^ vo <': CM ^ 8 1 CO r> <> x CO O r* v-r 8 y. G QJ CX 4J C O 4 G) J 1 I i f LO LTj i m CM rH CM n CM CM fTl 0 (J rM CD MH r^ C O 4J C O rCO ro co CM ^ >\ X C O G4 KD n fd CJ 0 G O r0 O^ rH -H 0 r^ e td >i -H rd T3 r> G fd 2 •H O G G C O C O r72 c p- ^ cn v CO rH 1 co^ ' rH ^ CD CM a> c rH CJ 4J C O rd O r> O CM r^ ro r> rH rH G r^ G CJ C D CX ! 0 P •H ^ G rd r* CJ ^ x: u ^4 c X rd r-D 0 CJ^ 3 rj CJ r ^ Cn Co' ^ < c ^ < > U ->. CJ > CO 0U -H G : z -H CJ B G ^ b x: 0 0 . v u 0 rH 0 u 0 G G r> G) P C O rd r> G rH f> rH rH C D 2 cd rd r-H- 0 G rj a) C O G fd Xi & LH ings ^TJ rd 0 4-> fd ograp eadin assi CJ 4J -P •H CJ C Cn C O 4J O c citatio al iidexi ulary •a 4-> X 4J 3 G •H C O XI Cn 11 d abstrac G rd O G) r> tr G a ^ C O 4J les -n •§ C O "0 C O 0 rH -p -H EH C O -P 0 rd H 4-> to .Q rd C O 2 rH a < CD rd CO B EH ro O rx ^ G ro UD E 60 8 Q o 9 ^ 8 R 0 0 0 0 8 8 i* S * 2. X 5 S^^S I I I I 3^ 6 CO r-i r-t "d 0 9. g S 3 JN o 5 o JS 0 0 • jq «> - 3 s s : *> •» L, *» q rH • o O ,0 »> -K 5 8"* O O O O Q O NO O O °. £5 i P. ^ M 3 2 W M O O O O O ON «r> O CO ON T*>C CO O O ON O O ON O CD CO o ON ON co C M o J O co ^ -4r H r~ O O O O N O O 8 o O o 5 o O o ON %r\ VO vO 8 ft ;* 8 fi >s V U P. M U o o *> • . • o <^ o O O O ^ o d d O o O 3 O 8 vo JS rj 8 t» g 5 ^. ^ 3 •-4 ** -5 • jq rH CO O O u. 5 »H S -3 » « S °. T) «• « ^J t, ai, t ,-j • *> U tn m O (-< I I I I I U >. O ON CO ON O ON C M *C rH NO $ £ ^ .J 22, £ £ 8 £ & O O O O O O O O O O O O KN 2 S g O H-5 3 - S 5 8 8 8 8 S 0 o • <• 6 o o o «r> 0 0 0 o 0 £ SN O & 3 O O O 3 >*. 3 ^ g-2 £ r-* H I ^ 8V . «« "i 5 I s ^ •(3 § 5 O 1 S O 8 O O ft ^ «r» C 5 ft S S O O • * I? g O «"\ JO O O O a . 1 4 2 .. Z 8 E. a s a . a 08 - c «.-) f H H p- M O V. ^ ? S. ao X Q LU Q. 0. g - s. • *> « « • P. 0 c -J 0 .3 5 • a) ? O • 0 0 •J X « "H 0 fr^ U. Z B c .3 q 4 co So «j O x l c* c -n U L, S S 0 ft. « M < 3 3 % * s ^j 0 0 x 0 2: j '-; * I *' 0 & O r-i c .H M «0 n 15 0 CI £ rj •< a) H «j ,-H a HI ->! f-| C => a t> H -^ H J 2 i : > ^ col 6 u * to 0 ^ o 3 ( 5 ^ o M to q U 3 , j l i 5> *> p o i s > w 3 : n 3 t a i " » i * i k t^ . < u «> > r - ' - « > > o « B " < • - c x: ft. W M ^ a c 0 0 u. * 2S325 2 &H 2 nl ~> ui 81 Dl Sparck Jones 1 standard formatted level 4 collections are obtained 30 a) by processing the document descriptions and indexing vocabulary automatically to delete stop words and generate stems with associated term numbers; and b) by regularising the document, request and relevance judgement sets and • deriving basic files from them. These files all conform to regular layout principles. collection consists of files, or streams,as follows • Thus a standard 1/ 0 1 2 3 4 5 documents, with original document identifying numbers, and sorted term numbers documents serially numbered, with sorted term numbers requests serially numbered, with sorted term numbers relevance judgements, serially numbered, with sorted original document numbers original document number - serial number equivalence list term dictionary, giving term numbers in serial = alphabetical order and alphabetically first variants in each word group with a common stem term dictionary with words in alphabetical order, if cc responding terms not serial documents, with original numbers, and sorted term names requests, serially numbered, with sorted term names inverted documents, i.e. inverted stream 0 inverted requests, i.e. inverted stream 2 inverted relevance judgements, i.e. inverted stream 3 document frequencies, i.e. a list with the number of terms in each document in stream 0 request frequencies, i.e. a list with the number of terms in each request in stream 2 relevance judgement frequencies, i.e. a list with the number of relevant documents for each reauest in stream 3 distribution data, giving the numberof items, maximum, minimum and average length, and length distribution of streams 1/0,2 and 3 and 2/0,1 and 2 document term frequencies, i.e. a list with the number of documents for each term in stream 9 request term frequencies, i.e. a list with the number- of requests for each term in stream 10 document relevance frequencies, i.e. a list with the number of requests for each relevant document in stream 11 original request number - serial number equivalence list 6 7 0 9 10 11 12 13 14 15 2/ O 1 2 3 * conventional numbering with historical rationale 40 D2 Collection innut data processing normally generates a variety of other streams which, since they all conform to the common layout conventions, constitute a natural extension of the basic standard collection. These streams may include an alternative to stream O, with within document frequencies of terms indicated 1 " " request 3 with serially numbered documents plus different sets of relevance judgements etc, listings of the full dictionary, indicating truncation and grouping, and so on. Note that a standard collection refers to a particular set of documents (and reauests) indexed in a particular way, i.e. to what may be called a collection version. Thus the Cranfield 1400 documents and requests indexed by manually assigned terms, and by terms automatically extracted from titles, lead to two standard distinct collections. Also, when subsets of documents and requests are selected, all the frequency information is different, so these also generate distinct standard collections. Thus the Cranfield 200 manually indexed document collection is different from the 1400 one. The particular form of standard collection just given is merely illustrative. It is evident that more complex collections like the ideal one(s), or collections with radically different characteristics, might require more elaborate, or alternative, standard forms. But we feel the principle of standardisation is very important. Data formats, particularly for operational systems, are not necessarily suited to research, so some modification may be needed; but full-blooded standardisation is usually more convenient in the long run. We have certainly found standard collections set up on the lines indicated very helpful. It should also be pointed ou': that standardisation is a non-trivial operation, so there are clear gains if it is done only once, in a competent way. 41 References Aitchison, T.M., Hall, A.M., Lavelle, K.H. and Tracy, J.M. Comparative evaluation of index languages, Part I, Design; Part II, Results, Project INSPEC, Institute of Electrical Engineers, London, 1970 Akiyama, S. Automatic document classification systems, M.Sc. Thesis, Department of Computer Science, University of Alberta, Canada, 1972 Augustson, J.G., and Minker J. Deriving term relations for a corpus by graph theoretical clusters, Journal of the American Society for Information Science, 21, 101-111, 1970 MarVor F.H., Veal, D.C. and Wyatt, B.K. Retrieval experiments based on Chemical Abstracts Condensates, Research Report No 2, UKCIS, University of Nottingham, 1974 llarber A.S., Barraclough, E.D. and Gray W.A. MEDLARS on-line search formulation and indexing, Technical Report Series, No. 34, Computing Laboratory, University of Newcastle upon Tyne, 1972 Cagan, C. A highly associative document retrieval system. Journal of the American Society for Information Science, 21, 330-337, 1970 Chan, F.K. Document classification through use of fuzzy relations and determination of significant features, M.Sc. Thesis,Department of Computer Science, University of Alberta, Canada, 1973 Cleverdon, C.W. , Mills, J., Keen, M. Factors .determining the performance of indexing .systems, Vols 1 and 2, College of Aeronautics, Cranfield 1966 Feinman, R.D. and Kwok, K.L. Classification of scientific documents by means of self-generated groups employing free language, Journal of the American Society for Information Science, 24, 382-396, 1973 Hansen, I.B. CA condensates as a retrospective search tool. A commentary, Information Storage and Retrieval,9, 201-205, 1972/5 Horsnell, V. , Intermediate lexicon In .information science Librarianship, Polytechnic of North London, 1974 School of Jacquesson, A. and Schieber, W„D. Term association analysis, Information Storage and Retrieval, 9 , 85-94, 1973 _ Jahoda, G. and Stursa, M.L.A comparison of a keyword from title index with a single access point per document alphabetic subject index, American Documentation 20, 377-380, 1969 Keen, E.M. and Digger, J.A. Report of an information science index languages test, Aberystwyth College of Librarianship, Wales., 1972 Lancaster, F.W. Evaluation of the MEDLARS demand search service, National Library of Medicine, Bethesda, Md., 1968 Lancaster, F.W. Evaluating the effectiveness of an on-line natural language retrieval system, Information Storage and Retrieval. _ 8 223-245. 1972 42 References (contd.) Litofsky, B. Utility of automatic classification systems for information storage and retrieval, Ph.D. Thesis, University of Pennsylvania, 1969 Lo, A.K. An automatic optimum iterative feedback document retrieval system, M. Sc. Thesis, Department of Computer Science, University of Alberta, Canada, 1972 Minker, J., Peltola, E., and Wilson, G.A. Document retrieval experiments using cluster analysis, Journal of the American Society for Information Science, 24, 246-260, 1973 O'Connor, J. Text searching retrieval of answer-sentences and other answerpassages, Journal of the American Society for Information Science, 24_, 445-460, 1973 Olive, G., Terry, J.E. and Datta, S. Studies to compare retrieval using titles with that using index terms. SDI from 'Nuclear Science Abstracts8, Journal of Documentation, 29, 169-191, 1973 van Rijsbergen, C. Further experiments with hierarchic clustering in document retrieval. Information Storage and Retrieval, 10, 1974, 1-14 Salton, G. A 'theory of Indexing Regional Conference Series in Applied Mathematics No. 18, Society for Industrial and Applied Mathematics, 1975 Salton, G. Dynamic .'information and llibrary processing, Englewood Cliffs, N.J.: Prentice-Hall, 1975 Sparck Jones, K. Automatic indexing 1974,, Computer Laboratory, University of Cambridge, 1974 (OSTI Report 5193) Svenonius, E. An experiment in index term frequency, Journal of the American Society for Information Science, 23, 109-121, 1972 Tell, B.V. Retrieval efficiency from titles and the cost of indexing, Information Storage and Retrieval, Jj 241-243, 1971 Vaswani, P.K.T. and Cameron, J.B. The National Physical Laboratory experiments in statistical word associations and their use in document indexing and retrieval, Publication 42, Division of Computer Science, National Physical Laboratory, Teddington, 1970 Vickers, P. The costs of mechanized information systems, Directorate for Scientific Affairs, OECD, 1974 Virgo, J.A. An evaluation of Index Medicus and Medlars, Journal of the American Society for Information Science, 21, 254-263, 1970 Wilson, E. Report on Automatic Indexing Workshop, April 29-30, 1974 , Computing Laboratory, University of Kent, 1974 (OSTI Report 5194) Crouch, D.B. A clustering algorithm for large and dynamic document collections, Southern Methodist University, Dallas, 1972.