A 1 Appendix 1 t List of reference works consulted British Library Research and Development Department, Inventory of Bibliographic Data Bases Produced in the U.K.. BLR&DD Report No, 5256, British Library, London, 1976, Hall, J.E. On-Line Information Retrieval 1965-1976: A Bibliography vith a Guide to On-Line Data Bases and Systems. Aslib Bibliography No, 8, Aslib, London, 1977. Leigh, J.A. Guide to Computer-Based Literature Searching Services in Science and Technology available in the U.K.. Science Reference Library, British Library, London, 1976. Thomas, A# (Ed.) London University Central Information Services (LUCIS) Guide to Computer-Based Information Services, 2nd Ed., Central Information Services, University of London, 1977* Tomberg, A. (Ed.) Data Bases in Europe: A Directory to Machine-Readable Data Bases and Data Banks in Europe. 2nd Ed. Aslib & Eusidic, London, 1976. Williams, M.E. and Rouse, S.H. Computer Readable Bibliographic Data Bases: A Directory and Data Source. ASIS, Washington D.C., 1976. A 2 Appendix 2 : Sample entry from Williams and Rousey Data Base Directory STI 1. BASIC INFORMATION NAME OF DATA BASE ACRONYM/SHORT NAME: STI FULL NAME: Specialized Textile Information Service FREQUENCY OF UPDATE: bimonthly NUMBER OF TAPES ISSUED PER YEAR: TLME SPAN COVERED EY DATA BASE: 24 01/70 to present CORRESPONDENCE WITH PRINTED SOURCE: 1: World Textile Abstracts FEWER REFERENCES ON TAPE THAN PRINTED SOURCE: 2. PRODUCER/DISTRIBUTOR/GENERATOR INFORMATION PRODUCER OF DATA BASE Shirley Institute Manchester M20 8RX England PERSON TO CONTACT RE. INFORMATION ABOUT TAPES: Mr. R. J. F. Curoberbirch (NOTE: Four research institutions collaborate in covering the literature for STI: British Launderer's Research Association (covers all aspects of laundering and dry cleaning); Hatra (covers all aspects of knitting and making-up); Shirley Institute (covers all fibres other than wool and hair, and their properties and processing other than knitting, including lacenaking, knotting and braiding, and bonding, needling and tufting); Wira (covers wool and hair and their properties and processing other than knitting)) DISTRIBUTOR OF DATA BASE Shirley Institute Manchester M20 SEX Encrland PERSON TO CONTACT RE..DISTRIEUTlDN OF TAPES: GENERATOR OF (PHYSICAL) DATA BASE Shirley Institute Manchester M20 8RX England PERSON TO CONTACT RE. TAPE FORMAT, SOFTWARE DATA: 3. AVAILABILITY AND CHARGES FOR DATA BASE TAPES CURRENT FILES: 1975, 2k bimonthly issues RESTRICTIONS: ownership of the data base remains vested at the Shirley Institute LEASE: $1100.00 base fee plus $260.00 for cost of tapes BACK FILES: 197C-1974, annual Issues RESTRICTIONS: ownership of the data base remains vested at the Shirley Institute LEASE: $1100.00 base fee per annual issue, plus $250.00 and air mail postage SAMPLE TAPES: no charge to bonafide potential subscriber H. SUBJECT-MATTER AND SCOPE OF DATA ON TAPE SUBJECT MATTER AND SCOPE: Covers the literature of permanent technical value on the science and technology of textiles plus all relevant UK and US patent literature. SUBJECT CATEGORY: Chemistry and Chemical Engineering; Patents; Textiles; in the STI Service and air mail postage. in the STI Service for cost of tapes NAME: NAME: NAME: yes k.2) (See Introduction section Mr. R. J. E. Cumberbirch Dr. K. C. Ellis A 3 STI (cont'd.) TARGET USER COMMUNITY: Research and industry 8,000 ANTICIPATED GROWTH PATE (AVG. NO. OF SOURCE ITEMS ADDED PER YEAR): BIBLIOGRAPHIC DATA BASE SOURCE ITEMS CAN BE APPROXIMATED AS: ^0% Journal articles Of these, 50% are published in English No. /journals from which selected articles are entered: 500 0% Government reports/documents UQ% Patents Of these, 50£ are U.S.A. patents 0% Monographs, published proceedings, theses, etc. 0J Preprints, papers presented at conferences 0% Manufacturers' catalogs 0% News items from releases, press reports, broadcasts, etc. 20$ Other: Manufacturer's technical publications, government reports/documents; preprints; monographs, published proceedings, theses, etc. 100$ Total 5. SUBJECT ANALYSIS/INDEXING DATA Controlled keywords (from thesaurus). Avg. no. terms/document: 10 Chemical identifiers (nomenclature codes, notations, fragmentation schemes): Trade narne(s) 6. BIBLIOGRAPHIC DATA BASE ELEMENTS PRESENT ON TAPE Author(s) Author address Editor(s) Editor address Corporate author(s) Corporate author address Title of item(original lang., translat., translit.) Title of source item(journal, conf. proc.) Bibliographic reference (volume.issue) Page(s), inclusive or total Date(publicaticn date of item, dates for patents) Publisher Place of publication Cited references bv source item: total no. Patent information (NOTE: The reference Riven for patents consists of (1) the patent number, (2) the publication date, and (3) the application date and number in the country issuing the patent, or if a prior date of amplication (the convention date) and the name of the countrv and the number.) Language (of item) 7. TAPE SPECIFICATIONS Indication of type of item(e.g. jnl. art., mono., govt, d o c , etc.) CODE: ECD Treatment code or level of approach(e.g. review, app'n., theory, etc.) CHARACTER SET: upper and lower case Item accession number, unique id DENSITY (BPI): 556 NUMBER OF TRACKS: 7; 9 LABELS: not present RECORD FORMAT: fixed RECORD FORMAT: blocked NUMBER BYTES/BLOCK: *4,096 or 16,38^ NUMBEh BITS/BYTE: 6 8. SEARCH PROGRAMS 9. DATA BASE SERVICES OFFERED (Brokers not listed. TRANSLATION SERVICES AVAILABLE FROM: producer; See Introduction section H.9) producer; DOCUMENT DELIVERY, REPROGRAPHIC SERVICES AVAILABLE FROM: A 4 STI (cont'd.) 10. USER AIDS OFFERED BY DATA BASE PRODUCER VOCABULARY/TERM LIST, THESAURUS: STI Keyterm List. An approved list of keyterms that shows the relationship of each term to other keyterms: AVAILABLE IN: hardcopy; PRICE: available free of charge to data base subscribers; non-subscribers $17.00 for both keyterm lists. Advisory Lists of Related Keyterms. AVAILABLE IN: hardcopy; PRICE; available free of charge to data base subscribers; non-subscribers $17.00 for both keyterm lists DATA BASE TAPE DOCUMENTATION: World Textile Abstracts Service and Specialized Textile Information Service, Manual for Abstracts, January 1975. Describes the coverage, subject indexine; production of tapes and data base format and data elements; AVAILABLE IN: hardcopy; PRICE: available on request Appendix 3 J Example of data SECTION 1 NATURE OP DATA BASE A 5 I base questionnaire as sent out 7 W * . iams -"1 -. ie 010.0 030.0 040.0 045.0 060.1 BASTC BfEORMATTOM Nane of data base Prequencj'- of update Biy/e§kly Time span covered Jan '75 First available in machinereadable form If subset data base, name of parent Related machine-readable files Corresponding printedcompilation Same/fewer/more references on tape than compilation Jan f Materials to present 75 075.0 080.0, 085.1 090.1090.? None 110.0 110.1110.5 110.6 130.0 130.11^50.5 130.6 150.0 150.1150.5 150.6 PRODTTCER ETG. TWORM^TTON Producer organisation Producer address • rin I Chemical Abstracts Service The Ohio State University, Columbus OH ^3210 Marketing Department United Kingdom Chemical Information Service The University, University Park, Nottingham. Dr. A. Kabi United Kingdom Chemical Information Service The University, University Park, Nottingham. — • • — — — • • • • • * Person to contact Distributor organisation, in U.K. Distributor .^ddres^ Person to contact Generator (of physical data base) organisation Generator address Person to contact Dr. A. Kabi 310.0 SUBJECT, SCOPE T?!TOHMATTOU Subject matter and scope Chemical and chemical engineering aspects of the production, properties and applications of industrially important materials. _______ „ , —-* 320.0 I Subject category Chemistry & Chemical Engineering; Mining; Metallurgy. 340.0 350.0 Approx. number source items by I December 1976 Averare number items added per yUnr I ^0.1 vo.11 360.12 360.13 ?60.14 360.2 560.5 Percent journal articles Percent of these in English Number of journals from which all articles taken Number of journals from which some articles taken Approx. number of journal titles reviewed for input Percent government reports, documents Percent patents Percent of these which are U.K. , . , . ___ — . __ - o 55 57 1^,000 2 35 ~3~6o~73T 360.4 360.5 "• 360.6 Percent of monographs, theses, conference proceedings, etc. Fercent preprints, conference papers, etc Percent non-government reports, documents Percent manufacturers catalogues Percent news items, etc. 8 0 0 0 5~-077~ 560.8 560.81 560.9 0 Percent other 0 Description of other rorcent total (100)") 100 Percent material not in Rng-li.Bh How rnu^h per ite*n translated to English rrPflX-TTG INFORMATION Ho s p e c i a l indexing 410.0 415.0 415-1 4?o.o 4?o.i Rnrjehed t i t l e s Average number added terms per ti tl 0 TTnoontrolled' (natural 1 anguage) keywords Average number of keywords per document Patents only. Yes 2 nhrases Phrases of approx, '+ words K?y theTTo be voH s t r i n g s or only single words A7 425.0 Controlled (thesaurus) terms Thesaurus name 425.1 Average number of terms per document Subject-headings Subject heading system name Yes T30TT "Average number of headings per document Subject codes Ye; 435.1 435.2 Subject code system name Average number of codes per document "Descriptive phrase or sentence Any other indexing Indexing source Are chemical identifiers used Are these in a specified record field Average number per document Percent data base having them ftTBLIOCRAPHTC INFORMATION | rIo bibliographic information "AuthorCs) (461.0460.0) Yes 505.0 510.0 511.0 Ye: Author address 512.0 " 513.0 514.O 515.0 5?0.0 525.0 530.0 -I Y£s_ Editor(£7 Editor address Yes Yes j Corporate a u t h o r ( s ) Yes Cc-porate author address Yes | Title of item (indicated as original,] i translation, transliteration) T i t l e of source B i b l i o g r a p h i c re?f>renee "(volume, irsno) ' Original, translation, transliteration Yes Yes • • 1 531.0 r 4 Pageef specified 0* total Publication date Publisher "Place of publication ;?.o Yes Yes Yes 535.0 536.0 540.0, 541.0 545.0548.1 550.0 • * References cited by sourcef in total or details Standard bibliographic codes, CODEN, ISSN/ISBN, other Abstract Short digest Patent information CODEN Yes 555.0 560.0 565.0 570.O 575.0 580.0 585.0 Yes Report number language Yes Indication of type of item (e»g* article, monograph, etc.) Treatment code or level of approach Item accession or other unique identifying number Price Yes Laui« Yes - - -~ » cortinued A9 SECTION 2 USE OP DATA BASE If you run a search service on your data basef please complete Section 2. If you only supply the data to search services run by others, please complete Section 3« (if you both run your own service and supply others, please complete both sections.) KSJ Code 1010.0 1020.0 1025.0 1025.1 Data base only searchable via abstract) journal, printed index, e t c Retrospective off-line searching available SDI searching available Time period for SDI On-line searching available All or part of data base available on-line Approx. number of searches per month, altogether Approx.-" number "off-line searcHes" Approx. number SDI searches Approx. number on-line searches Approx. number represented, Approx. number represented1 Approx. number subscribers altogther individual users altogether off-line users "ibyoVcT 10307T "104670" T0467T" I046.2" 1040.3 To"56".b 1050.1 T050.2 1050.3 I050.4' 1060.0 I 1070.0 1080.0 1080.1 1080~o2 "l080.y 1080.4 1080.5 "1090.0 Approx, number SDI users Approx • numbeiT~6n^in"e""usefs" Indexing fields available for searching Bibliographic fields available J for searching Searching by Boolean logic Searching by simple coordination Searching with term weights Arbitrary term truncation Other search methods ^^"¥eaTch"TorTOuIi.'tion and "searching bjr_user or intermediary Person to contact about search service... A 10 SECTION 5 SUPPLY OF DATA BASE 1110,0 UK search services to whom data base supplied (name, address, person to contact) 1120.0 Is data base available on Lockheedfs DIALOG system Signed Date A 11 Appendix 4 : List of CA and CAB subbases CA subbases CACon : CA CONDENSATES CBAC : CHEMICAL-BIOLOGICAL ACTIVITIES CIN : CHEMICAL INDUSTRY NOTES CT : CHEMICAL TITLES ECOLOGY AND ENVIRONMENT ENERGY POOD AND AGRICULTURAL CHEMISTRY MATERIALS POST : POLYMER SCIENCE AND TECHNOLOGY United Kingdom Chemical Information Service CAB subbases Animal Breeding Abstracts Apicultural Abstracts Dairy Science Abstracts Field Crop Abstracts Forestry Abstracts Helminthological Abstracts Herbage Abstracts Horticultural Abstracts Index Veterinarius Nutrition Abstracts and Reviews Plant Breeding Abstracts Review of Applied Entomology Review of Medical and Veterinary Ideology Review of Plant Pathology Soils and Fertilisers Veterinary Bulletin Weed Abstracts World Agricultural Economics and Rural Sociology Abstracts Commonwealth Agricultural Bureaux ' A 12 [Appendix 5 • Tabulated d a t a base| q u e s t i o n n a i r e r e pl i e s .> :> SECTION 1 NATURE OP DATA BASE A 0 2 (u • >- vi (a lb I d o 11 lams de 0.0 0.0 0.0 5.0 0.1 BASIC INFORMATION Name of d a t a base Frequency of u p d a t e Time span covered F i r s t a v a i l a b l e i n machiner e a d a b l e form I f s u b s e t d a t a b a s e , name of parent R e l a t e d machine-readable f i l e s Correspondinc p r i n t e d c o m p i l a tion Same/fewer/more r e f e r e n c e s on t a p e than compilation PRODUCER ETa B^FORMATTON Producer o r g a n i s a t i o n Producer address Person t o c o n t a c t Distributor organisation, i n U.K. D i s t r i b u t o r address Person t o c o n t a c t Generator (of p h y s i c a l d a t a base) organisation Generator address | £ KJ < !5 0 _ 0 ? 10 0 o U) week b i u ^ t v;«tk. h'wuM*feiwuk. b'wo/t VlKC^ - \>i%*iA kjwut 1 y (/I 0 (s~ fcr \t>* ** Tr k- irii+ 4i is ^ CiN ** 1ST1? 2 ir~ n$* ->* i ^ t-7 1$ 5.0 0.0, 5.1 0.1Oo? CA f«*»t or CAS SAhM. SQ*~t 0.0 0.1- \cA-s c*S CM £*s #K C#* d*£ c*cs\CM 0,5 0.6 0.0 0.10-5 0.6 3.0 3.13.5 3.6 UKQS OZtAS W£. c4\e»*. CfriCik" 3.0 * * ).o 5.0 ).0 I .. •™ t « — ' 1***1 56fc 3odk <+S\L A^oK 4-"?* I5«>KJ 2 <*. i £ 1 <* 81 0 ^0.1 3 Percent Journal articles Porcent of these in English 0 0 k is 17 SI si 2t0 ^ 0 7 2 | 85 w-.ii 360.12 3^0.13 ?60.14 "36072" "360T3"" H Sa tx fc>o % It *7 ft" I 4-3 *7 *7 *? I 57 Number of journals from which 260 all articles taken Number of journals from which 7.3 K somo articles taken Approx. number of journal titles KnO^K reviewed for input Percentrovernnient reports^ 2. % documerts Percent patents loo |/»f|£ » -1—t *}' 1 >" * '3 < H" j • • - 3 $'3 ?.o • • fc < 0 Controlled (thesaurus) terras Thesaurus name Average number of -berms per document Subject headings Subject heading system name Average number of headings per document Subject codes Subject code system name j Average number of codes per document Descriptive phrase or sentence Any other indexing Indexing source Are chemical identifiers used Are these in a specified record field Average number per document Percent data base having them !5.1 5(5.0, m> r rs i Its kcx ^5 w«s *1tJ | 1 i 5o.i $5.0 $5.1 55.2 — YS I V * ^c; 1 \ 1« I • • » 61.080,0) - yf •— - ys ^ei ^ei ">es - »)1%0 114.0 i15.0 Editor(s) — yc If*/ I l)«f 1 *|€* i Htf V yr i t,«{ w S" M*S y> jvj S« *1<* v« Ecli tor address Co^pornte autbor(s) «,«*i y x j' N«f yt ! Corporate author nddrecs v f 1" W 15 « i?o.o i >25.0 >30.0 " "1 • I Title of iten (indicated as 0">*ig.innlf iWtl. translation, tranr.literation) Title of source 1 l <>r«5.r r1 • — — - — - * — i H* Mfl * . . jHky>C*N-cofcV ;^€S 1«« !^/ j V c< 0 ' wN (UiPtf^ j 1»J i l "H«< Patent information 4e« Report number language I c,t* : v,es Hes W j VCJ 56O.0"" y; «f#; 565.0 570.0 575.0 580.0 "58~5VO~~ i« H*f 4«i ; i e ? j H*-1 ^C j S « M«* ^< 1es t<.o ).0 j.O ).1 5.0 r »v» e *\IC i»>#*ik >^ - v 9L o nr^cffclC #»» tfc ^A^W i •73i> 30IS 38T- 11T2 1 3o^ T* cyrfi 1 ! m PS .Abe. -73 r> ).0 f 5.1 P & }*** f ; 3.0 3.1- Sovi^oe 3.5 D.6 3.0 3.13.6 3.0 3.13.6 Person t o c o n t a c t Distributor organisation, i n U.K. D i s t r i b u t o r address Person t o c o n t a c t Generator (of p h y s i c a l d a t a base) organisation Generator address 1 ' i C/t£ CM i Cftb i ^M Of*t> CMb ! i CM> i •i... i.. •> 3.5 Person t o c o n t a c t SUBJECT. SCOPE TNTOHMATTON Subject m a t t e r and scope fc*t*Ct* Subject c a t e g o r y Approx. number source items by December 1976 Average number items added p e r jrenr ^•vthc*, c*|Kft, Yvvtt.4*f( O.O 0.0 0.0 3.0 oft K j | |C loo K 6SK (Oofc. *?*»£ ItfK J ^fcf ,2 H \7$K ISoK YK\ lo^C M A 19 . Ctri15 5*WUV?A*O? 1 0 <5 560.1 }<-•'". 1 1 Percent journal a r t i c l e s P e r c e n t of t h e s e i n English Number of j o u r n a l s from which a l l a r t i c l e s taken Number of j o u r n a l s from which some a r t i c l e s taken Approx. number of j o u r n a l t i t l e s reviewed f o r i n p u t Percent ex>vernment r e p o r t s , documents Percent p a t e n t s Percent of t h e s e which a r e U.K. P e r c e n t of monographs, t h e s e s , conference proceeding's, e t c . P e r c e n t p r e p r i n t s , conference papers, etc Percent non-government r e p o r t s , documents P e r c e n t manufacturers c a t a l o g u e s P e r c e n t news i t e m s , e t c . Percent other D e s c r i p t i o n of o t h e r Percent t o t a l (100?') P e r c e n t m a t e r i a l n o t i n English How mueh p e r i t e ^ t r a n s l a t e d to English PIDEXTTTG BPOKMTION No s p e c i a l indexing Enriched titles So**< Zo *T Jo IP1 1 Jo I 60 Yso~ 1 0 1S*\ 5"0 0 360.12 560.15 560.14 560.2 560.5 560.51 560.4 560.5 • 560.6 560.7 560.8 560.81 360.9 — — Ut 1 ^ X ! *t°° llo jfo~ 1 ° ,4 I 1IC It 3K 13*. I h 1$ 1 * i.a* ?K WK 1 0 - UK. 1 2 X ! 0 O T\2o 5 S" $ 2 ' 2. S^v<. 5" 3 0 0 1 J***** $*»*< O 0 5" s 1]T 0 ko 0 D 2 O _JL1 Pc? 0 1 1 1 ID MiSC hooks. loo So 5hM«'*Jj !>0»fcs i 1 1 1 Me H'J _5L1 —L 4-0 Twin | • 4* > 410.0 415.0 415.1 1 1 i „,„ i 4?o.o 420.1 "• Average number added terms p e r title "IfTftcontrolled ( n a t u r a l language) keywords Average number of keywords p e r document Kjv- t h e s e be vord s t r i n g s or only s i n g l e words f j I — -f- i« * • < p*rm 1 i* sWw£ 1 -—[L 1 $W*^s (• — i ... A 20 (Ail 9i£ta** cm) . -S 1 r VeV. SA t«x VPt KM? \ | Average number of codes per document Descriptive phrase or sentence Any other indexing 1"f , . M 1 1 n u 1 ** AO Indexing source ;i.oK>.0) Are chemical identifiers used Are thpse in a specified record field Average number per document Percent data base having them *0 •NO Ko L 4« W«j 1« *H << K Tt< ")« . * \ 1*i yj Y i ^?.o Vr )*» 535.0 536.0 540.0, 541.0 545.0548.1 550.0 • » 555.0 560.0 565.0 570.0 575.0 580.0 585.0 r« V v< toVM s/r i«f V Vi biL. 1 1 Ifi M ? * ^M S/f i«f t/r no JV m r* • ru w i j s« i j IV — i ... _.. V i ^tf V V • i ni ^j Indication of type of item (e.g. article, monograph, etc.) Treatment code or level of approach Item accession or other unique identifying number Price v* H« i Vi V* J 1 :$•*"* l I*1 1 • ys 1 f «> A 22 Cfr(\ W4U*^>, SECTION 2 USE OP DATA BASE si > o 330.1 M'O.O ys yj ' v oil V ti( V .u )4o;r >40.2 )^0o3 )56Vo" >50.1 )50.2 )50"-3 )50.4 Approx. number SDI users A pprox • number" oh^Tin e~lis ersH Indexing fields available for searching Bibliographic fields available )6o7o )70.0 .1*1. V M 'f if VJ «?•« £°2LJLe ^Sl^ILS. )80.0 >80.1 )8(T02 )80#"3 )80.4 >80#5 >90.0 Searching by Boolean logic Searching by simple coordination Searching with term weights Arbitrary term truncation Other search methods T ^s ^ formulation arid" searching" by uner or intermediary Person to contact about searc s *J f- v* Ihi^mM, W«K Ul^ A 23 C^vS Gvxv, •*<•* f 4i^| £ ^_ SECTION 5 SUPPLY OP DATA BASE 1110.0 UK search services to whom data base OPr6 supplied (name, address, person to contact) 6M | CM | 6M fiUU- 1120.0 ! Is data base available on Lockheed's DIALOG system *)«* T ^ ;^j .*! *i« J V» ^ S I Signed Date I SECTION 1 NATDBE OF DATA BASE liaras BASTC INFORMATION Name of data base Frequency of update Time spnn coverod First available in machinereadable form If subset data base, name of parent Related machine-readable files Corresponding printedcompilation Same/fewer/more references on tape than compilation I A 24 ' « Sri ' Sci .0 .0 .0 .0 bi*»**\£ w«*k tl10 lo- fe\- H Sci .1 .0 .0, .1 .1- .3 L***t fw* | 11 \ CVrtnyi _.._ .,-. " 1 A 26 ! ifatyv '5.0 Controlled (thesaurus) terras Thesaurus name :5.1 Average number of terms per document Subject headings Subject heading system name Average number of headings per document Subject codes Subject code system name Average number of codes per document descriptive phrase or sentence Any other indexing Indexing source j sn • 1 Scl : i ^eJ \HiP\H i« sn | 3 Y* | ! /o 1 :o.o, mm 1 i I i .. 01 5.0 5.1 I5.2 5 v,#| i i *T*I i •v >«» ; 5.0 i 1«f ; S«J i s*» i. | Vf *?« uo 5.0 >.o ... . . J >.0 ! vs nee -— ).o _V* i i rf I i A 27 I kw&. 531.0 57?.0 535-0 536.0 540.0, 541.0 " 54535"548.1 550.0 •* s*i y$ 1,*; 1 in i« v V«f ^>€( Place of publication References cited by source, in total or details "Standard bibliographic codes, CODEN, ISSN/lSBN, other Abstract Short digest Patent information Report number language Indication of type of item (e.g. article, monograph, etc.) Treatment code or level of approach Ttem accession or other unique identifying number | Price . 1*1 hnu (SO^Tk •»W V5 *,.j ^€j _ ._ . 555.0 560.0 565.O 570.0 575.0 580.0 585.0 V v^Al 1±< H* n«i yi ~ — — • • • • — - 1«f 1*1 1* . 4 .. .1 V* J 1 cort j nued A 28 Mtfr SECTION 2 USE OP DATA BASE sn So{ If you run a search service on your data basef please complete Section 2« If you only supply the data to search services run by others, please complete Section J. (if you both run your own service and supply others, please complete both sections.) J Code 10.0 20.0 Data base only searchable via abstract journal, printed index, etc. Retrospective off-line searching available SDI searching available Time period for SDI On-line searching available 30.T All or part of data base available on-line Approx. number of searches per month, altogether Approx. "niun^er off-line iearcEeF Approx. number~SDl"¥earches" no 1^4 >** A 29 Mm SECTION 3 1110.0 SUPPLY OF DATA BASE m Sc UK search services to whom data base supplied (name, address, person to contact) 1120.0 ! Is data base available on Lockheed's DIALOG system ^ec yS Signed Date A 30 Appendix 6 : CAB subbase sizes 1976/7 Animal Breeding Abstracts Dairy Science Abstraots Field Crop Abstracts (1) Forestry Abstracts Helminthological Abstracts Herbage Abstraots Horticultural with(l) 41.3 K 90 36 K K Abstracts 23.3 K 27.4 K 44 28 K K 1977 inorease 6 8 11 8 6 18 10 12 12 K K K K K K K K K 31.3 K 8.5 K 12.5 K Index Veterinarius (2) Nutrition Abstracts and Reviews Plant Breeding Abstracts Review of Applied Entomology Review of Medical and Veterinary Mycology with(3) Review of Plant Pathology (3) Soils and Fertilisers Veterinary Bulletin Weed Abstracts with(2) 41.1 K 37 K 2.5 K 28 K 6.5 K 8 K 22.3 K 7.5 K 12.9 K 25.5 K 4 8 K K World Agricultural Economics A 31 Appendix 7 ; Analysis of relevance judgement requirements This appendix provides the argument for the number and nature of relevance assessments for the 'ideal1 collection. This is initially presented in a very elementary form. A summary of the assumptions made, and a tabulation of the numbers of assessments required in different circumstances, follow. Some implications of the approach are then discussed. In the last section an alternative presentation in more conventional statistical language is provided. A Elementary presentation The essential object of our calculations is to ensure that adeauate relevance information is collected for the evaluation of future experimental results, in the case where exhaustive relevance assessment is impossible. In the past, test data has either been 'globally1 exhaustive in the sense that the entire collection is assessed for the test reouests, so that the status of anv document retrieved bv a new strategy, i.e. indexing or searching device or procedure, is known; or 'locally' exhaustive in that some or all of the output of particular strategies being considered is assessed, so that the performance of these strategies can be compared with respect to the combined assessed output for the strategies. The problem encountered in considering relevance assessment for the 'ideal' collection is that while global exhaustion is not possible, local exhaustion as conventionally defined cannot be used for future strategies since these mav produce output not related in a well-defined way to the initial output for which assessments are provided: i.e. the new output is neither included in the assessed output nor overlapped with it in a coherent way; and if an attempt is made to meet this difficulty of local exhaustion bv making the initial searches so broad that their output is likely to be exhaustive of future output, this appears to implv that an unacceptably large number of assessments have to be made. The question is therefore whether the initial output can be obtained and assessed, at the time when the 'ideal' collection is set up, in such a way that future experimental output can be properly evaluated. Essentially, our argument is that under suitable conditions, this can be achieved by sampling from the initial output: that is, that in the collection building, we conduct searches for the given reauests (i.e. based on the given need statements), probably a variety of alternative searches for each request, and establish a pool of retrieved documents for each request. From this pool a sample is drawn for assessment. This sample constitutes the set of documents of known relevance status which is used to characterise, and more importantly to compare, performance for new strategies. Our argument has two components: it covers, first, the way in which future experiments are to be conducted, i.e. comparative evaluation is to be carried outt and second, the characteristics of the relevance data needed to support this evaluation methodology. 1. evaluation The object of a retrieval test, at the lowest level, is taken to be a comparison between two strategies, A and B, representing different choices of indexing, searching, or whatever. As indicated in the Report * A 32 text, we will for clarity take these to be two strategies not used to generate the 'ideal' collection itself, though either or both can in principle be generating strategies. To compare the two strategies, we consider only that part of each output that has alreadv been assessed; the remainder is discarded. The relative performance of the two strategies is then represented by their relative success in retrieving assessed relevant documents and rejecting assessed non-relevant ones. More specifically, the following assumptions are made about the wav in which such comparative evaluation is to be conducted. We are concerned with recall and precision,* and these are interpreted as probabilities to be estimated by proportions based on samples. That is, recall is the probability of retrieving a document given that it is relevant, and precision is the probability of a document being relevant given that it is retrieved, where these probabilities for a request and a document collection as a whole may be estimated from the proportions of relevant and non-relevant retrieved by a strategy from a proper sample of the collection which is fully characterised for relevance. To establish a significant difference in performance, over a set of requests, between strategy A and strategy B, we apply the iign test. We base it on the assumption that a percentage difference, say of 5%, between the recall or precision performance of the strategies for a single request is represented by Prob - Prok)ti * ^ % ' an(* o v e r a H the requests we look for a * particular significance level, say 5% or 1%, and want the test to have a particular power, say 95%. That is, an individual measurement for the application of the test is a single request comparison between strategies A and B, so the set of measurements is the set of comparisons over the complete set of requests. We also assume that the sampling distribution for the performance measurement comparisons being considered, i.e. the differences of proportion representing recall or precision, is normal; and for convenience we assume a normal approximation to the binomial distribution for the power of the test. Finally, the overall assumption is made that the probability of strategy A being superior to strategy B is constant over the request set. 2. data If we are thus to evaluate performance comparatively, this imposes certain requirements on the assessment data needed. The evaluation cannot begin without assessment information, so the requirements concern the amount and properties of the assessment data exploited in the application of the test. The essential requirement is for a certain number of assessments overall; for practical reasons this can be referred to in terms of the number of requests required and the number of assessments per request, but the two are inversely related so the total of assessments is the same. Clearly, the fundamental requirement for the whole process is that the relevance status of some of the documents retrieved by strategy A and by strategy B should be assessed. Thus it is not useful to provide asessment data in the initial collection creation by assessing a random sample of the entire collection in relation to the requests. For a large collection in particular this is likely to find no relevant documents at all. On the whole, 'real' search strategies do better than random sampling, so an effective way of seeking to ensure that some of the documents retrieved * or related performance characterisations A 33 by future strategies A and B have been assessed is to provide the assessment data initially by evaluating actual search output. That is, strategy performance is evaluated bv reference to assessed initial search output in order to ensure output overlap, rather than by reference to assessed randomly selected documents. It may further be sufficient to assess a sample of the initial output. However, for this use of initial search output assessments to be valid, the same requirements must apply to the search output, or any sample of it, as apply to the entire collection and any sample of this. Thus we assume, globally, that the initial output as a whole contains all the documents relevant to a request, and all the output of future searches for the request. Further, we assume that any sample drawn from the initial output is a random sample; and that any such sample is also a random sample of the output of a particular strateqv. Taking the proposed evaluation procedure and data requirements together gives specific percentage samples of the initial output which must be assessed to provide adeauate evaluation data for different conditions. In particular, we find that as the number of reauests considered decreases, the size of sample increases (UP to 100%). This data is tabulated below. Since the comprehensiveness requirements of the initital output are only likely to be satisfied in practice by combining the outputs of several alternative searches for a/request, the output is referred to as the pool. The table covers different sizes of request set. The results for each set are independent of those for others: the results taken together simply show how for different sizes of set the number of assessments to be made as a percentage of the pool varies. For each request set, the assessment data is given for a sign test significance level of 5% or of 1% for any comparison between strategies A and B. The table then shows the critical region of the test? the number of individual measurements, i.e. recruest comparisons, favouring one of the strategies (sav A) needed for a significant result; the probability that the measurements will favour A over the set required for 95% power in the test; and the sample size required to identify a difference between the two strategies that this implies: the sample size is the number of assessments for each of the strategies that must be provided, i.e. the extent to which the strategy output overlaps the assessed pool output. The actual formulae used in the numerical calculations are not given here: they are of an orthodox statistical nature. The second section of the table shows the percentage of the pool to be assessed for recall and for precision respectively, for given numbers of relevant documents per request, on average, and for given numbers of retrieved documents. That is, for a reliable recall comparison between two future strategies A and B for 500 requests, say, with an average of 25 relevant documents per reguest in the total collection, '36% of the pool would have to be assessed for a 5% significance level in the sign test. For precision and sav lOO documents retrieved on average, 9% must be assessed. Note that the percentage to be assessed in anv given case is always higher for recall than for precision; and also that for very low numbers of reauests and relevant documents, a difference at 5% or at 1% cannot be established. Note also that the figures are approximate, i*e. have not been worked to a verv hiah level of accuracy. A 34 Summary and tabulation For reference the assumptions underlying the table can be summarised as follows: 1 for future experiments comparing strategies A and B 1 2 3 4 5 6 7 we evaluate using recall and precision; recall and precision are probabilities estimated by proportions based on samples; we use the sign test for validating performance differences; a percentage difference, say of 5%, between A and B, in recall or precision, is indicated by p**ob - Prob = 5%; a normal sampling distribution for difference of proportions; a normal approximation to the binomial distribution for the power of the sign test; the probability of finding A better than B is constant across requests. B. 2 for assessment data 1 2 3 4 all relevant documents are contained in the pool; the output of A,and of B, is contained in the pool; a sample from the pool is a random sample; a pool random sample is also a strategy output random sample. The situation being modelled can be illustrated thus: strategy A random sample of the D O O I pool jrelevant documents strategy B A 35 GQ CD 0) o X 0) •H C pj O (D P •H a3 ^) P C D C O P Cti CD o p u w 0 CD — • •- » 0 iH (D M ft -P8 fc O •pin C D U U p ?> JH CD , < o ^ o VO CO C in M *~ rCOO ON f<"\ C O O t— ON O O CO VO VO C O O O TtCM VO ^ O O invo O O ^ ( D O U O , M tPJ O T - O r VO CM CM O ^"CO O C M ^ O, Pj O in T o o * - ^ o , T - C\J PJ O T~ • • ^vo CM i n • • KMA • • r n TJ- • • • • MO CO • • C\JC\J VO h - • • CM CM LT\VO p w 0) P CM K*\ CM m CD P 03 U P m bo P •H p M P si. C D 0) CH O ^ CD (Tj O N N ^ C O O l > - ON u o % o H CO M O M O rH P i n O J) \ & O. * P^ P H C O O Pi rn *^ O xtVO CO *t O CM K"\ COVO T-CM V D O - 3 - C O CM VO T" CM T - T - T~ T~ ( M ^ f OC\J T~ T~ T~ T~ CH P Q CO C CM a> p -P o c 3 COO ^VO VO CM C v J O K M A rOv^t fr\ VOIA T-T— COVO TtC\J r^oo O ^ c\J r r \ CM h n CM C\J C\J CM m ON CM T — iH ^ < O D > a> II CD vo T- KN VO 0) O Cp rH i n O C t — ^ O, D M U Pi •HO U n * O p| a CH i: -p O CO V O O CM CM T - C M CMC^ O K N T — t— C O O CO ON VO CO Og 6 O iH O t ; 0 o3 Q) n C D o <; CD P CO o TJ C H e co - H 0 rP 0 D P P . P C D H • P P C °H D C D o a o >>o J O r- ^ O CMhTN COVO T-CM V O O "*fr CO CMVO CM ^ t O C M T - C M T - T - T - ' T - T - T - T" T - p H LTN O O «t 0) C M ^ R O , VO CO U It Pi HO ft p ,0 N pq - H •H - H rH CO W o ,Q O C D M C U ,0 - P &• g ^ •H •a 00 O ^-VO VOCM C M O COVO rtCM K M AK\^ CM r r \ CMr<"\ - ^ 00 O ^ t CMCM CMCM •H M C O O II II II CD P< W -o\ vo co vo i>- invo M T T - T— T — T- u CD o in o CM r n O CT\ C0 CO ^ ^ CM T K \ «Rf in in 0 P CO o vo o vo o vo O vo CM CM • • CM CM • • CM CM • • CM CM O vo • • CM CM O vo o vo o vo • • • • • • CM CM CM CM CM CM c H P, hQ l^< e n p C O 0 2 a1 C D H in T- in T — i n T- i n T- rn-r- i n - i n T- i n T- co 0) 0 0 0 0 •sf O CD in 0 0 0 0 00 0 GD O O CTv B) = Prob(B> A) = h . H A Since the test is based on the binomial distribution we can use the approximation (1) to find the critical reqion, that is, that value of the standardised normal variable which needs to be exceeded for H to be rejected at 5% significance level. If k is the number of requests, then under H : p = h and we qet x - k/2 = 2x - k Using normal tables (Hoel, 398) we find 2x - k ^ ~ A * gives 5% significance. This means for k = 100 (requests) we must have at least 60 A's > B's say. 3. The above is all we would need to be concerned with if there were no uncertainties in the probabilities we are comparing, that is, no uncertainty for precision or recall at each request. Unfortunatelv our decision whether A > B or B > A is based on two samples, one for A and one for B. So that even if there is a real difference between A and B, because we are sampling this difference will fluctuate. Of course were we to take infinite (read, very large) samples we would get the true difference. Assume now that the probabilities we are trving to estimate (recall and preci^on) are constant across requests; we can then calculate a minimum sample size for each request (it will be the same) necessary for the sign test to show a significant difference. To do this we must assume viiat the real difference is. Obviously, the bigqer the real difference the smaller the sample size necessary to reflect it. There A 38 is a sampling theorem for differences (see Hoel, 149) which again allows us to use the normal approximation to the binomial. The effect of using the theorem is for us the calculation of P(x ^ x ) for any given n (sample size). Conversely, given the P(x ^ x ) we can calculate the n necessary to achieve it. Once we have done this the constancy across requests will tell us the expected number of requests with A > B. Conversely, given the number of A ' s ) B's dictated by the sign test and letting it equal the expected number derived above, we can choose a sample size to achieve the expected number. Because we design for an expected number it is reasonable to assume that 50% of the times the number of A ' s > B's will fall below the critical value and 50% of the times above. But we would like a higher chance of significance, or to put it another way, a higher chance of rejecting the null hvpothesis if it is in fact false (i.e. P^ - P„ = 5% is true). This can onlv be done TV M by increasing P(x N x ) (or equivalently increasing the sample size). We want to ensure a 95% chance when P - P = 5% that the number of A ' s > B's will exceed the critical value. In other words, for what value P(x > x ) will it be the case that there is a 95% chance of significance This we again get bv using the normal approximation to the binomial. We may illustrate the relationship between the critical region defined bv x > x and a 50% or 95% power of the test by the following i S c very crude diagram: 50% sampling curve 95% sampling surve / /l 1 I KftW&i^s^ X / of successes 95% probability that X > X * If k = 100 X c - 60 c H o * (P = h) 50% chance of 5% significance 95% chance of 5% significance H' = (P = .60) 1 " = (P = .68) H 1 Comments: a) Once we have the sample size we can use it to calculate the percent o^f the pool. The basic idea is that we want a random sample from the future outputs and relevant documents big enough to estimate precision and recall. For this we need assumptions 2.1 and 2.2 of section B above. b) The table given earlier shows a number of alternatives. One can do with fewer requests by increasing the number of assessments per request. c) The sign test could be replaced by a stronger test, in which case the design would be somewhat cheaper. P.G. Hoel, Introduction to Mathematical Statistics, 3rd Ed., Wilev, 1^62. A 39 ftirandix 6 : Reaearch project questionnaire. POSSTHLT? P^SA^CTT PR'^CT TTaI-T~ TrTC f TT^].' T?3T COTJ^ryPTO'T The fide^lf retrieval test collection.is intended to permit a variety of controlled indexing end retrieval experiments on real material, to encoumg^ inter-project compai'isons, and to reduce date preparation effort. Tt would consist essentially o ~ ? large set of bosio document descriptions, from wMch ~ different subsets with particular properties and fuller descriptions could be drawn: of off-line and on-line queries; and of associated relevance judgements The collection would be set up in a well-organised way, and -would be available in machine readable form. The first specification for the collection is given in K0 Sparck Jones and C»J. van Rijsbergen, "Report on the Need for and Provision of an 'Ideal1 Information Retrieval Test Collection", 1975? a ~ore detailed one is provided by K . Sparck Jones, "Outline Specification for the 'Ideal1 Information " Retrieval Test Collection", 1976, both available from K. Sparck Jones* Project topic Objective "lethodoV'-' A 40 Data r o r m i r ^ o n t s a) content b) form (machine/manual) Scale a) time; 1,2f5» or more years b) man-power: 1-?, 5-4* 5-6* o? nor a staff Status would like to start as soon as material is available (if not, is this because of other commitments, or because project is tentative) Name Address A 41 Appendix 9 : Teaching and on-line education questionnaires INFORMATION RETRIEVAL TEST COLLECTION: USE FOR TEACHING AND RESEARCH IN DEPARTMENTS OF COMPUTING, INFORMATION STUDIES, OR LIBRARIANS HIP 1 a) Topics under the general headings of information or data management, processing or retrieval, of interest to your department: b) Topics specifically studied in courses: 2 General data requirements, e.g. type and volume of material: for 1 a) s for 1 fo) 3 Levels of study, and numbers of students involved, in information processing: undergraduate, 3 years : 2 years : 1 year : p o s t g r a d u a t e , diploma : : : Name Department Address master*s degree d o c t o r ' s degree A 42 THE •IDEAL1 INFORMATION RETRIEVAL TEST COLLECTION : POSSIBLE USE IN CONNECTION WITH ON-LINE EDUCATION 1 Do you, or are you intending tot teach on-line searching? 2 If so, do you think that such data as that contained in the proposed test collection, if set up on a convenient computer, could be of value for your teaching activities? 3 Have you any special requirements in mind? 4 Would you expect or like to be able to use a local computer, or have to rely on remote access? 5 Number of students likely to be involved: a) undergraduate b) postgraduate Name Department Address