A 1

Appendix 1 t List of reference works consulted

British Library Research and Development Department, Inventory of Bibliographic Data Bases Produced in the U.K.. BLR&DD Report No, 5256, British Library, London, 1976, Hall, J.E. On-Line Information Retrieval 1965-1976: A Bibliography vith a Guide to On-Line Data Bases and Systems. Aslib Bibliography No, 8, Aslib, London, 1977. Leigh, J.A. Guide to Computer-Based Literature Searching Services in Science and Technology available in the U.K.. Science Reference Library, British Library, London, 1976.

Thomas, A# (Ed.) London University Central Information Services (LUCIS) Guide to Computer-Based Information Services, 2nd Ed., Central Information Services, University of London, 1977* Tomberg, A. (Ed.) Data Bases in Europe: A Directory to Machine-Readable Data Bases and Data Banks in Europe. 2nd Ed. Aslib & Eusidic, London,

1976.
Williams, M.E. and Rouse, S.H. Computer Readable Bibliographic Data Bases: A Directory and Data Source. ASIS, Washington D.C., 1976.

A 2

Appendix 2 : Sample entry from Williams and Rousey Data Base Directory

STI

1. BASIC INFORMATION NAME OF DATA BASE ACRONYM/SHORT NAME: STI FULL NAME: Specialized Textile Information Service FREQUENCY OF UPDATE: bimonthly NUMBER OF TAPES ISSUED PER YEAR: TLME SPAN COVERED EY DATA BASE: 24 01/70 to present

CORRESPONDENCE WITH PRINTED SOURCE: 1: World Textile Abstracts FEWER REFERENCES ON TAPE THAN PRINTED SOURCE: 2. PRODUCER/DISTRIBUTOR/GENERATOR INFORMATION PRODUCER OF DATA BASE Shirley Institute Manchester M20 8RX England PERSON TO CONTACT RE. INFORMATION ABOUT TAPES: Mr. R. J. F. Curoberbirch (NOTE: Four research institutions collaborate in covering the literature for STI: British Launderer's Research Association (covers all aspects of laundering and dry cleaning); Hatra (covers all aspects of knitting and making-up); Shirley Institute (covers all fibres other than wool and hair, and their properties and processing other than knitting, including lacenaking, knotting and braiding, and bonding, needling and tufting); Wira (covers wool and hair and their properties and processing other than knitting)) DISTRIBUTOR OF DATA BASE Shirley Institute Manchester M20 SEX Encrland PERSON TO CONTACT RE..DISTRIEUTlDN OF TAPES: GENERATOR OF (PHYSICAL) DATA BASE Shirley Institute Manchester M20 8RX England PERSON TO CONTACT RE. TAPE FORMAT, SOFTWARE DATA: 3. AVAILABILITY AND CHARGES FOR DATA BASE TAPES CURRENT FILES: 1975, 2k bimonthly issues RESTRICTIONS: ownership of the data base remains vested at the Shirley Institute LEASE: $1100.00 base fee plus $260.00 for cost of tapes BACK FILES: 197C-1974, annual Issues RESTRICTIONS: ownership of the data base remains vested at the Shirley Institute LEASE: $1100.00 base fee per annual issue, plus $250.00 and air mail postage SAMPLE TAPES: no charge to bonafide potential subscriber H. SUBJECT-MATTER AND SCOPE OF DATA ON TAPE SUBJECT MATTER AND SCOPE: Covers the literature of permanent technical value on the science and technology of textiles plus all relevant UK and US patent literature. SUBJECT CATEGORY: Chemistry and Chemical Engineering; Patents; Textiles; in the STI Service and air mail postage. in the STI Service for cost of tapes NAME: NAME: NAME: yes k.2)

(See Introduction section

Mr. R. J. E. Cumberbirch

Dr. K. C. Ellis

A 3
STI (cont'd.) TARGET USER COMMUNITY: Research and industry 8,000

ANTICIPATED GROWTH PATE (AVG. NO. OF SOURCE ITEMS ADDED PER YEAR):

BIBLIOGRAPHIC DATA BASE SOURCE ITEMS CAN BE APPROXIMATED AS: ^0% Journal articles Of these, 50% are published in English No. /journals from which selected articles are entered: 500 0% Government reports/documents UQ% Patents Of these, 50£ are U.S.A. patents 0% Monographs, published proceedings, theses, etc. 0J Preprints, papers presented at conferences 0% Manufacturers' catalogs 0% News items from releases, press reports, broadcasts, etc. 20$ Other: Manufacturer's technical publications, government reports/documents; preprints; monographs, published proceedings, theses, etc. 100$ Total 5. SUBJECT ANALYSIS/INDEXING DATA Controlled keywords (from thesaurus). Avg. no. terms/document: 10 Chemical identifiers (nomenclature codes, notations, fragmentation schemes): Trade narne(s) 6. BIBLIOGRAPHIC DATA BASE ELEMENTS PRESENT ON TAPE Author(s) Author address Editor(s) Editor address Corporate author(s) Corporate author address Title of item(original lang., translat., translit.) Title of source item(journal, conf. proc.) Bibliographic reference (volume.issue) Page(s), inclusive or total Date(publicaticn date of item, dates for patents) Publisher Place of publication Cited references bv source item: total no. Patent information (NOTE: The reference Riven for patents consists of (1) the patent number, (2) the publication date, and (3) the application date and number in the country issuing the patent, or if a prior date of amplication (the convention date) and the name of the countrv and the number.) Language (of item) 7. TAPE SPECIFICATIONS Indication of type of item(e.g. jnl. art., mono., govt, d o c , etc.) CODE: ECD Treatment code or level of approach(e.g. review, app'n., theory, etc.) CHARACTER SET: upper and lower case Item accession number, unique id DENSITY (BPI): 556 NUMBER OF TRACKS: 7; 9 LABELS: not present RECORD FORMAT: fixed RECORD FORMAT: blocked NUMBER BYTES/BLOCK: *4,096 or 16,38^ NUMBEh BITS/BYTE: 6 8. SEARCH PROGRAMS 9. DATA BASE SERVICES OFFERED (Brokers not listed. TRANSLATION SERVICES AVAILABLE FROM: producer; See Introduction section H.9) producer;

DOCUMENT DELIVERY, REPROGRAPHIC SERVICES AVAILABLE FROM:

A 4
STI (cont'd.)

10. USER AIDS OFFERED BY DATA BASE PRODUCER VOCABULARY/TERM LIST, THESAURUS: STI Keyterm List. An approved list of keyterms that shows the relationship of each term to other keyterms: AVAILABLE IN: hardcopy; PRICE: available free of charge to data base subscribers; non-subscribers $17.00 for both keyterm lists. Advisory Lists of Related Keyterms. AVAILABLE IN: hardcopy; PRICE; available free of charge to data base subscribers; non-subscribers $17.00 for both keyterm lists DATA BASE TAPE DOCUMENTATION: World Textile Abstracts Service and Specialized Textile Information Service, Manual for Abstracts, January 1975. Describes the coverage, subject indexine; production of tapes and data base format and data elements; AVAILABLE IN: hardcopy; PRICE: available on request

Appendix 3 J Example of data SECTION 1 NATURE OP DATA BASE

A 5 I base questionnaire as sent out

7

W * . iams -"1 -. ie 010.0 030.0 040.0 045.0 060.1 BASTC BfEORMATTOM Nane of data base Prequencj'- of update Biy/e§kly Time span covered Jan '75 First available in machinereadable form If subset data base, name of parent Related machine-readable files Corresponding printedcompilation Same/fewer/more references on tape than compilation Jan
f

Materials

to present

75

075.0
080.0, 085.1 090.1090.?

None

110.0 110.1110.5 110.6 130.0 130.11^50.5 130.6 150.0 150.1150.5 150.6

PRODTTCER ETG. TWORM^TTON Producer organisation Producer address
• rin I

Chemical Abstracts Service The Ohio State University, Columbus OH ^3210 Marketing Department United Kingdom Chemical Information Service The University, University Park, Nottingham. Dr. A. Kabi United Kingdom Chemical Information Service The University, University Park, Nottingham.
— • • — — — • • • • • *

Person to contact Distributor organisation, in U.K. Distributor .^ddres^ Person to contact Generator (of physical data base) organisation Generator address Person to contact

Dr. A. Kabi

310.0

SUBJECT, SCOPE T?!TOHMATTOU Subject matter and scope

Chemical and chemical engineering aspects of the production, properties and applications of industrially important materials.
_______ „ , —-*

320.0

I

Subject category

Chemistry & Chemical Engineering; Mining; Metallurgy.

340.0 350.0

Approx. number source items by I December 1976 Averare number items added per

yUnr

I
^0.1 vo.11 360.12 360.13 ?60.14 360.2 560.5 Percent journal articles Percent of these in English Number of journals from which all articles taken Number of journals from which some articles taken Approx. number of journal titles reviewed for input Percent government reports, documents Percent patents Percent of these which are U.K.
, . , . ___ — . __ -

o

55 57

1^,000 2 35

~3~6o~73T
360.4 360.5
"•

360.6

Percent of monographs, theses, conference proceedings, etc. Fercent preprints, conference papers, etc Percent non-government reports, documents Percent manufacturers catalogues Percent news items, etc.

8 0 0 0

5~-077~
560.8 560.81
560.9

0 Percent other 0 Description of other
rorcent total (100)")

100
Percent material not in Rng-li.Bh How rnu^h per ite*n translated to

English rrPflX-TTG INFORMATION Ho s p e c i a l indexing

410.0 415.0 415-1 4?o.o 4?o.i

Rnrjehed t i t l e s
Average number added terms per ti tl 0 TTnoontrolled' (natural 1 anguage) keywords Average number of keywords per document

Patents only.

Yes

2 nhrases Phrases of approx, '+ words

K?y theTTo be voH s t r i n g s or only single words

A7

425.0

Controlled (thesaurus) terms Thesaurus name

425.1

Average number of terms per document Subject-headings Subject heading system name

Yes

T30TT

"Average number of headings per document Subject codes

Ye;
435.1 435.2 Subject code system name Average number of codes per document "Descriptive phrase or sentence Any other indexing Indexing source Are chemical identifiers used Are these in a specified record field Average number per document Percent data base having them ftTBLIOCRAPHTC INFORMATION | rIo bibliographic information "AuthorCs)

(461.0460.0)

Yes

505.0 510.0 511.0

Ye:
Author address

512.0 "
513.0
514.O 515.0 5?0.0 525.0 530.0

-I

Y£s_

Editor(£7
Editor address

Yes Yes

j Corporate a u t h o r ( s )

Yes
Cc-porate author address

Yes
| Title of item (indicated as original,] i translation, transliteration) T i t l e of source B i b l i o g r a p h i c re?f>renee "(volume, irsno) ' Original, translation, transliteration Yes Yes

•

• 1
531.0
r

4

Pageef specified 0* total Publication date Publisher "Place of publication

;?.o

Yes Yes Yes

535.0 536.0 540.0, 541.0 545.0548.1 550.0
• *

References cited by sourcef in total or details Standard bibliographic codes, CODEN, ISSN/ISBN, other Abstract Short digest Patent information

CODEN

Yes
555.0 560.0 565.0 570.O 575.0 580.0 585.0

Yes
Report number language

Yes
Indication of type of item (e»g* article, monograph, etc.) Treatment code or level of approach Item accession or other unique identifying number Price

Yes

Laui«
Yes

-

-

-~

»

cortinued

A9

SECTION 2

USE OP DATA BASE

If you run a search service on your data basef please complete Section 2. If you only supply the data to search services run by others, please complete Section 3« (if you both run your own service and supply others, please complete both sections.) KSJ Code 1010.0 1020.0 1025.0 1025.1 Data base only searchable via abstract) journal, printed index, e t c Retrospective off-line searching available SDI searching available Time period for SDI On-line searching available All or part of data base available on-line Approx. number of searches per month, altogether Approx.-" number "off-line searcHes" Approx. number SDI searches Approx. number on-line searches Approx. number represented, Approx. number represented1 Approx. number subscribers altogther individual users altogether off-line users

"ibyoVcT 10307T "104670" T0467T" I046.2" 1040.3 To"56".b 1050.1 T050.2 1050.3 I050.4'
1060.0 I 1070.0 1080.0 1080.1 1080~o2 "l080.y 1080.4 1080.5 "1090.0

Approx, number SDI users Approx • numbeiT~6n^in"e""usefs" Indexing fields available for searching Bibliographic fields available J for searching Searching by Boolean logic Searching by simple coordination Searching with term weights Arbitrary term truncation Other search methods ^^"¥eaTch"TorTOuIi.'tion and "searching bjr_user or intermediary Person to contact about search service...

A 10

SECTION 5

SUPPLY OF DATA BASE

1110,0

UK search services to whom data base supplied (name, address, person to contact)

1120.0

Is data base available on Lockheedfs DIALOG system

Signed Date

A 11 Appendix 4 : List of CA and CAB subbases

CA subbases CACon : CA CONDENSATES CBAC : CHEMICAL-BIOLOGICAL ACTIVITIES CIN : CHEMICAL INDUSTRY NOTES CT : CHEMICAL TITLES ECOLOGY AND ENVIRONMENT ENERGY POOD AND AGRICULTURAL CHEMISTRY MATERIALS POST : POLYMER SCIENCE AND TECHNOLOGY

United Kingdom Chemical Information Service

CAB subbases Animal Breeding Abstracts Apicultural Abstracts Dairy Science Abstracts Field Crop Abstracts Forestry Abstracts Helminthological Abstracts Herbage Abstracts Horticultural Abstracts Index Veterinarius Nutrition Abstracts and Reviews Plant Breeding Abstracts Review of Applied Entomology Review of Medical and Veterinary Ideology Review of Plant Pathology Soils and Fertilisers Veterinary Bulletin Weed Abstracts World Agricultural Economics and Rural Sociology Abstracts

Commonwealth Agricultural Bureaux

'

A 12 [Appendix 5 • Tabulated d a t a base| q u e s t i o n n a i r e r e pl i e s

.>
:>

SECTION 1

NATURE

OP DATA BASE

A
0

2 (u
•

>-

vi
(a lb

I
d
o

11 lams de 0.0 0.0 0.0 5.0 0.1 BASIC INFORMATION Name of d a t a base Frequency of u p d a t e Time span covered F i r s t a v a i l a b l e i n machiner e a d a b l e form I f s u b s e t d a t a b a s e , name of parent R e l a t e d machine-readable f i l e s Correspondinc p r i n t e d c o m p i l a tion Same/fewer/more r e f e r e n c e s on t a p e than compilation PRODUCER ETa B^FORMATTON Producer o r g a n i s a t i o n Producer address Person t o c o n t a c t Distributor organisation, i n U.K. D i s t r i b u t o r address Person t o c o n t a c t Generator (of p h y s i c a l d a t a base) organisation Generator address |

£

KJ <

!5 0
_

0

? 10

0 o U)

week b i u ^ t v;«tk. h'wuM*feiwuk. b'wo/t VlKC^ - \>i%*iA kjwut

1
y

(/I

0

(s~
fcr

\t>*
**

Tr k- irii+ 4i is
^ CiN **

1ST1?

2 ir~ n$*
->*

i ^ t-7

1$

5.0
0.0, 5.1 0.1Oo?

CA
f«*»t

or

CAS

SAhM. SQ*~t

0.0 0.1-

\cA-s

c*S

CM

£*s

#K

C#* d*£

c*cs\CM

0,5 0.6
0.0 0.10-5 0.6 3.0 3.13.5 3.6

UKQS OZtAS W£<l5 C/lfctf t/lfclS

ttoS
i

w*i5 tfcftS

ou\s

4

Person t o c o n t a c t SUBJECT. SCOPE TTTFORMATTON Subject m a t t e r and scope Subject c a t e g o r y Approx. number source items by 2.(,M 27SK December 1976 Average number items added p e r tfct*>. c4\e»*.
CfriCik"

3.0

*

*

).o
5.0 ).0 I

..
•™

t
« — '

1***1 56fc

3odk
<+S\L

A^oK

4-"?*

I5«>KJ

2

<*.

i £ 1
<*

81
0

^0.1

3
Percent Journal articles Porcent of these in English

0

0

k
is 17 SI si
2t0
^ 0

7 2 | 85

w-.ii
360.12 3^0.13 ?60.14 "36072" "360T3""

H
Sa tx

fc>o %

It *7

ft" I 4-3 *7 *7

*? I 57

Number of journals from which 260 all articles taken Number of journals from which 7.3 K somo articles taken Approx. number of journal titles KnO^K reviewed for input Percentrovernnient reports^ 2. % documerts Percent patents

loo |/»f|£ <ffC AffcJHf-K

Kf*.
(

il»

4-

M
10

o

10

(3 I 15 3 ^ i ^o
IP

all

T60T3T"
6O.4 550^5~

Percent of these which are U.K. Percent of monographs, theses, conference proceedings, etc. Percent preprints, conference papers, etc Percent"rinn-governmentreports, documents Percent manufacturers catalogues Percent news items, etc.

10

S

11

»P

o i

Ko)
0

0

o
— i .

0

360.6

0

<W

0

0 0

o
0

! 0

0

360.8 1^81 560.9

Percent other Description of other Porcent total (100?) Percent material not in English How mu"h per ite^ translated to English PrDEXTTIG TMEORMATION T o s p e c i a l indexing T Enriched titles

o

r:
100

o

0

loo

too (oo ! 100 i loo ho

410.0 415.0 415.1 420.0 4?0.1

Average number added terms per i titls Uno""i trolled (natural language}" 1*s ! V* «1« Veyvo rd s /[verage number of keywords p e r S document Y'?l' t h e s o ho vord s t r i n g s o r kW*j*|$WiNp S W j * only s i n g l e words

t*k*tsi p*k*>» -1—t

*}<s \i<t

Vs i*i*51 «^y

I2.-5 ?.3 2.3 \-i
|5tY»ljf;5WrvijsWfr^5 SWIV^I

2-3
^ K ^ 5

1

A 14

1>' 1 >"
*

'3 <
H"
j
• • -

3

$'3
?.o
• •

fc
<

0

Controlled (thesaurus) terras Thesaurus name Average number of -berms per document Subject headings Subject heading system name Average number of headings per document Subject codes Subject code system name j Average number of codes per document Descriptive phrase or sentence Any other indexing Indexing source Are chemical identifiers used Are these in a specified record field Average number per document Percent data base having them

!5.1 5(5.0,
m>

r
rs
i Its kcx
^5

w«s

*1tJ | 1

i 5o.i $5.0
$5.1 55.2
—

YS
I

V * ^c;
1

\ 1«

I

• • »
61.080,0)

-

yf
•— -

ys

^ei

^ei ">es
-

»)<i

—

•

*

•

. . .

-

^

05.0

BIBLIOGRAPHIC INFORMATION No bibliographic information
„ „ __,

j
j

iidVo"" "
i11.0

Author(s) Author address

in
<i'S

1e*

w
1.

S« «
•\«*

*' i
1

H2.o
>1%0 114.0 i15.0

Editor(s)
—

yc
If*/ I l)«f 1 *|€* i Htf

V yr i t,«{ w S" M*S y> jvj S«
*1<*

v«

Ecli tor address Co^pornte autbor(s)

«,«*i y x
j'

N«f yt

! Corporate author nddrecs

v

f

1"

W
15 «

i?o.o i
>25.0 >30.0 " "1
•

I

Title of iten (indicated as 0">*ig.innlf iWtl. translation, tranr.literation) Title of source 1 l

<>r«5.r
r1

•

— — - — -

*

—

i

H*

Mfl

* . .

jH<c
V.)
ii«

Bibliographic rc^Vrence (volume,

i^suo)
i
i

i" J<1<'
i
i

i

!

A 15,

c^h

531.0 5'?.0

Pages, specified or total Publication date Publisher ^lace of publication

i
<1«.< 1*« Hef K«« M" ^5

V
V

t
i

M"
1" ^el

v<ej H*j <,tj

535.0 "536:0"

V*< j \e* \ Y* «f*5 ; 1<tf | w « ,

540.0, References cited by source, in 541.0 _ total or details 54570~- 1 "Standard bibliographic codes," 548.1 CODEN, ISSN/ISBN, other Abstract 550.0 Short digest 555.0
r

Ct*>ky>C<vbriC<Q*N«wwi^^'w>*N-cofcV ;^€S 1«« !^/ j V

c< 0

' wN

(UiPtf^

j 1»J i l "H«<

Patent information

4e«
Report number language

I c,t* : v,es

Hes

W

j VCJ

56O.0""

y;
«f#;

565.0 570.0 575.0 580.0 "58~5VO~~

i« H*f

4«i ; i e ? j H*-1 ^C j S « M«*

^< 1es

t<<S ! S«I

Indication of type of item (e.g. *ej artic 1 e, monograph, e t c . ) "Treatment co3e~or level~6f" approach Item accession or other"unique" _identifying number "Price

w

1€{

y;

(f«i

1

, s

ll*

5

cor-tinned

A 16

Orf\

SECTION 2

USE OP DATA BASE

i 3

"2

3

fc

u

If you run a search service on your data base f please complete Section 2. If you only supply the data to search services run by others, please complete Section 3» (if you both run your own seijvice and supply others, please complete both sections*) J Code 10.0 20.0 Data base only searchable via abs'trac^ journal, printed index, etc. Retrospective off-line searching available SDI searching available Time period for SDI On-line searching available All or part of data base available on-line Approx. number of searches per month, altogether Approxi* number" off-Tine seaFcHes"

^57o~ £57T 3o7o" 30.T 467o

W
1

f^S All

35*

40.2 40 o 3

"Approx. nWbeF"SM~search"e"s"
Approx. number on-line searches r

10
Approx. number _reprpsented, Approx. number represented i Approx. number subscribers altogther individual users altogether off-line users

»5b".b
50.1 50.2

ZOO

50.3 50.4"

Approx. number SDI users Appr^.~numbeF~6n^Tine us'ers"
Indexing fields available for se arching Bibliographic fields available for searching Sparching by Boolean logic Searching by simple coordination Searching with term weights Arbitrary term truncation Other search methods Is search T^rmulation and "searching"" IM«SM^ by user or intermediary I'^^Aj. Person to contact about

!0

6o7o ~
70.0 BO.0
80.1 J0.2 30.3 30.4 30.5 ?0.0

4~
Vi

A 17

CA

V

i
SECTION 3 SUPPLY OF DATA BASE 1110.0 UK search services to whom data base supplied (name, address, person to contact)

i

I
2 0

*

<

?!

Is

s

V6

\m)

1120#0 J

Is data base available on Lockheed's DIALOG system

f

I

I

Signed

Date

A 18

1
i

i

SECTION 1 Hams l.O

NATURE OP DATA BASE

BASIC INFORMATION Name of d a t a b a s e Frequency of u p d a t e Time spnn covered F i r s t a v a i l a b l e i n machiner e a d a b l e form I f s u b s e t d a t a b a s e f name of parent R e l a t e d machine-readable f i l e s Correspondinp p r i n t e d c o m p i l a tion Same/fewer/more r e f e r e n c e s on t a p e than compilation PRODUCER ETa INFORMATION Producer o r g a n i s a t i o n Producer a d d r e s s
I

>.o
).0 j.O ).1
5.0

r
»v» e *\IC
i»>#*ik

>^

-

v
9L

o
nr^cffclC

<W«r»»ffv

M#*H _... _

<*>#»» tfc

^A^W

i

•73i>

30IS

38T-

11T2

1 3o^ T*
cyrfi
1 !

m
PS .Abe.

-73

r>

).0 f 5.1

P &
}*** f<iM<

Av*. t*cnrt S***<

s*~<

C6<w-*

$C\AV~<

eft Otf>
;

3.0 3.1-

Sovi^oe

3.5 D.6
3.0 3.13.6 3.0 3.13.6 Person t o c o n t a c t Distributor organisation, i n U.K. D i s t r i b u t o r address Person t o c o n t a c t Generator (of p h y s i c a l d a t a base) organisation Generator address

1

'

i
C/t£ CM
i

Cftb
i

^M

Of*t>

CMb

!
i

CM>

i
•i...
i.. •>

3.5

Person t o c o n t a c t SUBJECT. SCOPE TNTOHMATTON Subject m a t t e r and scope fc*t*Ct* Subject c a t e g o r y Approx. number source items by December 1976 Average number items added p e r jrenr

^•vthc*, c*|Kft,
Yvvtt.4*f(

O.O 0.0 0.0 3.0

oft K

j

|

<K*>|C

loo K 6SK

(Oofc.

*?*»£
ItfK J

^fcf
,2

H
\7$K

ISoK

YK\

lo^C

M

A 19

.

Ctri15 5*WUV?A*O?

1 0

<5
560.1
}<-•'". 1 1

Percent journal a r t i c l e s P e r c e n t of t h e s e i n English Number of j o u r n a l s from which a l l a r t i c l e s taken Number of j o u r n a l s from which some a r t i c l e s taken Approx. number of j o u r n a l t i t l e s reviewed f o r i n p u t Percent ex>vernment r e p o r t s , documents Percent p a t e n t s Percent of t h e s e which a r e U.K. P e r c e n t of monographs, t h e s e s , conference proceeding's, e t c . P e r c e n t p r e p r i n t s , conference papers, etc Percent non-government r e p o r t s , documents P e r c e n t manufacturers c a t a l o g u e s P e r c e n t news i t e m s , e t c . Percent other D e s c r i p t i o n of o t h e r Percent t o t a l (100?') P e r c e n t m a t e r i a l n o t i n English How mueh p e r i t e ^ t r a n s l a t e d to English PIDEXTTTG BPOKMTION No s p e c i a l indexing Enriched titles So**<

Zo

*T

Jo

IP1

1 Jo

I 60

Yso~ 1
0 1S*\

5"0
0

360.12 560.15 560.14 560.2 560.5 560.51 560.4 560.5 • 560.6 560.7 560.8 560.81 360.9 — —

Ut
1

^

X ! *t°°
llo

jfo~ 1 ° ,4
I 1IC It

3K 13*. I h
1$ 1 *

i.a*

?K

WK 1
0
-

UK. 1 2 X ! 0
O

T\2o

5

S"
$

2
' 2.

S^v<.

5"
3 0 0 1

J***** $*»*< O 0

5"

s 1]T
0

ko
0
D

2
O

_JL1

Pc?
0

1

1
1

ID
MiSC

hooks. loo So

5hM«'*Jj !>0»fcs
i

1
1
1

Me

H'J _5L1
—L

4-0

Twin
| •
4*

> 410.0 415.0 415.1

1
1

i

„,„

i

4?o.o
420.1 "•

Average number added terms p e r title "IfTftcontrolled ( n a t u r a l language) keywords Average number of keywords p e r document Kjv- t h e s e be vord s t r i n g s or only s i n g l e words

f

j
I

— -f-

i«
* • <

p*rm 1
i* sWw£ 1

-—[L
1

$W*^s

(• —

i

...

A 20

(Ail
9i£ta**

cm) .
-S

1

r
VeV.

<u 5

?.o

Controlled (thesaurus) terms Thesaurus name

StfwK

T?- TM k«s
wHr
1ISVI

51 .
0.0,
• *

Average number of "berms per document Subject headings Subject heading system name | Average number of headings per document Subject codes Subject code system name

3-t
M*i
f

B

4 -

V*
15 \a

yi

n«
*

-T7

61 . 5.0 51 . 5.2
*• * — •

\-1 He*

^r
«U f
J>SA

t«x VPt

KM? \
| Average number of codes per document Descriptive phrase or sentence Any other indexing

1"f
, .

M

1 1
n
u

1 **
AO

Indexing source

;i.oK>.0)

Are chemical identifiers used Are thpse in a specified record field Average number per document Percent data base having them

*0

•NO

Ko L

4<S
• -

»n
Vf

)5.0 I0.0 11.0 I2.0 I3.0 I4.0 15.0 ?0.0
T7=—«

BIBLIOGRAPHIC INFORMATION No bibliographic information Author(s) Author address Isditor(n) Editor address Corporate autbor(s) Coborate author address

V*
h

1«J

-its

v«

S«

h«
>«

W«j
1«

*H

<< K

Tt<
")«

. *<ktt 1
1V," V; -) 1 i J
r"
j v*

V M 1<?

V*
**w

1 Title"of
'

L

v<

^'c

\,*r

?5.o
50.0

item (indicated as original, translation, transliteration) Title of source Pibliographic reference (volume, ifsue)

I6? h«s h«J

M*

y< 1«S
ys
!,«

j~Vf ]yc

2".l Lei
** **

:

• "

A 21

CM

6A1ft. ?vJ.l,*fo,

£- 3
531.0
c

o

Pagesf specified or total Publication date ^Publisher "Place of publication References cited by source, in total or details Standard bibliographic codes, CODEN, ISSN/ISBN, other Abstract Short digest Patent information Report number language

W

n<«
«i«5

w
V*
«JM

V»
7«<
y< ! V>
\

1*i yj
Y i

^?.o

Vr

)*»

535.0 536.0 540.0, 541.0 545.0548.1 550.0 • » 555.0 560.0 565.0 570.0 575.0 580.0 585.0

r«

V
v<
toVM

s/r
i«f

V

Vi

biL.
1<V

1'L 1 v> 1
Ifi

M ? * ^M
S/f
i«f

t/r no

JV

m

r*
•

ru w
i

j

s« i
j

IV
— i

... _..

V

i

^tf

V V
•

i

ni
^j

Indication of type of item (e.g. article, monograph, etc.) Treatment code or level of approach Item accession or other unique identifying number Price

v*

H« i
Vi

V*
J 1

:$•*"*
l

I*1

1 •

ys

1 f «>

A 22

Cfr(\ W4U*^>,

SECTION 2

USE OP DATA BASE

si
> o

<i -

I

If you run a search service on your data basdf please complete Section 2. If you only supply the data to search services run by others, please complete Section 3« (if y°u both run your own service and(supply others, please complete both sections*)

!J Code
110,0 I20.0 Data base only searchable via abstract] journal, printed index, etc. Retrospective off-line searching available SDI searching available "Time period for SDI r*o*\ On-line searching available All or part of data base available on-line Approx. number of searches pef~~ month, altogether Approx." numberr"off^Tirie searcfies Approx. number SDI searches Approx. number on-line searches Approx© number represented, Approx© number represented,, Approx. number subscribers altogther individual users altogether off-line users

It;
V*
^4*

M*(

1"

^576" )?57l >3<MT
>30.1 M'O.O

ys
yj '

v
oil

V
ti(

V

.u

)4o;r
>40.2 )^0o3

)56Vo"
>50.1 )50.2 )50"-3 )50.4

Approx. number SDI users A pprox • number" oh^Tin e~lis ersH Indexing fields available for searching Bibliographic fields available

)6o7o
)70.0

.1*1.

V M 'f if
VJ

«?•«

£°2LJLe ^Sl^ILS.
)80.0 >80.1 )8(T02 )80#"3 )80.4 >80#5 >90.0 Searching by Boolean logic Searching by simple coordination Searching with term weights Arbitrary term truncation Other search methods T ^s ^ formulation arid" searching" by uner or intermediary Person to contact about
searc

s *J
f-

v*
Ihi^mM,

W«K

Ul^

A 23

C^vS Gvxv, •*<•*

f 4i^| £
^_

SECTION 5

SUPPLY OP DATA BASE

1110.0

UK search services to whom data base OPr6 supplied (name, address, person to contact)

6M |

CM |

6M
fiUU-

1120.0 !

Is data base available on Lockheed's DIALOG system

*)«*

T ^

;^j

.*!

*i«
J

V»

^ S

I

Signed

Date

I
SECTION 1 NATDBE OF DATA BASE liaras BASTC INFORMATION Name of data base Frequency of update Time spnn coverod First available in machinereadable form If subset data base, name of parent Related machine-readable files Corresponding printedcompilation Same/fewer/more references on tape than compilation

I A 24

'

«

Sri
'

Sci

.0

.0 .0 .0

bi*»**\£

w«*k

tl10

lo-

fe\-

H
Sci

.1
.0
.0,

.1 .1-

.3

L***t

fw<V

»^o^<.

.0 .1-

PRODUCER ETa 3 W 0 R M T T 0 N 1 Producer organisation Producer address

ItT

lw*V\Uf

|S|

.5 .6
.0 .1.5
Person to contact Distributor organisation, in U.K. Distributor address Person to contact Generator (of physical data base) organisation Generator address
1 • • 4

)*€

.6
.0 .1-

.5 75

Person to contact

, * •

.0 .0 .0 .0

SUBJECT, SCOPE iriTOHMATTON Subject matter and scope

*

Subject category
-iff t*<*oci I

Approx." number source items by December 1976 Average number items added per vpar

IM
ISVK

6o£

1 fc^n
S^IC

«fc

A 25
1
,<

!

sn
^0

[ 1

9

^0.1
VT.0.11

Percent journal articles Percent of these in English Number of journals from which all articles taken Number of journals from which some articles taken Approx. number of journal titles reviewed for input Percent rovernment reportsf documents Percent patents Percent of these which are U#K. Percent of monographs, theses* conference proceedings, etc# Percent preprints, conference papers, etc Percent non-government reports, documents Percent manufacturers catalogues Percent news items, etc. Percent other Description of other Fercent total (100?}.. Percent material not in English How mu^h p^r iten translated to English

to
-70
ZC0
%M

\oo

?o to
Svo

360.12 360.13
360.14

|3.8U

1°
\n-t\c
0 0

I4fc

ICO
0

360.2
360.3 ""360.31 360.4 360.5 — 36O.6 360.7 360.8 360.81 360.9 -

is
U
! j

<to
$0

fci.
0 0
0

r
0

0 0

*
0
._ .

1 1
i
1
!

0 0 0 0

0 0

0

•«•
1 IOO

\oo

loo

So
j

! (to
\ L

\
i

1
»
1

VilCL,

J

1

r
1

PH)EXTT:G INFORMATION
410.0 "415.0 No special indexing
i

!

Enriched titles Average number added terms per titlo j Uncontrolled (natural language) keywords Average number of keywords per document V'Dy thosn be vord strings or only single words |
i • i

1
1

415.1 I
420.0 420.1

»

*-

Vi
S^MP

j

!V J . I

_ [

>* |

11 \
CVrtnyi
_.._ .,-.

"

1

A 26
!

ifatyv
'5.0 Controlled (thesaurus) terras Thesaurus name :5.1 Average number of terms per document Subject headings Subject heading system name Average number of headings per document Subject codes Subject code system name Average number of codes per document descriptive phrase or sentence Any other indexing Indexing source

j sn
•

1 Scl
:
i

^eJ
\HiP\H

i«
sn

|

3
Y* |

! /o
1

:o.o,
mm

1
i I

i

.. 01
5.0 5.1
I5.2
<m

3
\ "

1 i

;

i

1
i

Z

m

r ~^
i

•

1,00.0)

Are chemical identifiers used Are these in a specified record field Average number per document Percent data base having them BIBLIOGRAPHIC INFORMATION No bibliographic information | Author(s) Author address Bditor(s) Editor address Corporate autbor(s) Oo^porite author address Title of item (indicated as original, translation, transliteration) TJtle of source Bibliographic reference (volume, irrsie)

V<
IA0 •

VJ
•

i i

1

wo

5.0

6.1T
1.0 2*0

V*
>5 v,#|

i
i

*T*I
i

•v
>«»

;

5.0

i 1«f
; S«J
i

s*»
i. | Vf

*?«

uo
5.0

>.o

... . . J
>.0

! vs
nee
-—

).o

_V*

i
i

rf

I i

A 27

I

kw&.
531.0 57?.0 535-0 536.0 540.0, 541.0 " 54535"548.1 550.0
•*

s*i
y$
1,*;

1

in
i<i

Pages, specified or total Publication date Publisher

^iS

Vi
•>«

v

V«f
^>€(

Place of publication References cited by source, in total or details "Standard bibliographic codes, CODEN, ISSN/lSBN, other Abstract Short digest Patent information Report number language Indication of type of item (e.g. article, monograph, etc.) Treatment code or level of approach Ttem accession or other unique identifying number | Price .

1*1

hnu
(SO^Tk

•»W

V5
*,.j ^€j _ ._ .

555.0 560.0 565.O 570.0 575.0 580.0 585.0

V
v^Al

1±<

H*

n«i
yi
~ — — • • • • — -

1«f 1*1

1*
. 4 .. .1

V* J

1

cort j nued

A 28

Mtfr
SECTION 2 USE OP DATA BASE

sn

So{

If you run a search service on your data basef please complete Section 2« If you only supply the data to search services run by others, please complete Section J. (if you both run your own service and supply others, please complete both sections.)

J Code
10.0 20.0 Data base only searchable via abstract journal, printed index, etc. Retrospective off-line searching available SDI searching available Time period for SDI On-line searching available 30.T All or part of data base available on-line Approx. number of searches per month, altogether Approx. "niun^er off-line iearcEeF Approx. number~SDl"¥earches" no 1^4 <v

1*)

257^"
25.1

\a
WMfc

W.2
4073 •5076
•50.1 •5072 '50-3 150.4'

Approx. number on-line searches
Approx. number represented, Approx. number represented1 Approx. number subscribers altogther individual users altogether off-line users \1o *6v
\\0 HriX*.

Approx. number SDI users ~Ap^rai^.~n^ Indexing fields available for searching Bibliographic fields available for searching Searching by Boolean logic Searching by simple coordination Searching with term weights Arbitrary term truncation Other search methods

11 # i * * /
3*oM.

£676
70.0 '80.0 80.1 80.2 80.3 80.4

1«*

V/
yj

80.5 90.0

Is search forraulatiori and "searching" ^JT_^ e r or intermediary Person to contact about

>>**

A 29

Mm
SECTION 3 1110.0 SUPPLY OF DATA BASE

m

Sc

UK search services to whom data base supplied (name, address, person to contact)

1120.0 !

Is data base available on Lockheed's DIALOG system

^ec

yS

Signed Date

A 30

Appendix 6 : CAB subbase sizes

1976/7 Animal Breeding Abstracts Dairy Science Abstraots Field Crop Abstracts (1) Forestry Abstracts Helminthological Abstracts Herbage Abstraots Horticultural with(l) 41.3 K 90 36 K K Abstracts 23.3 K 27.4 K 44 28 K K

1977 inorease 6 8 11 8 6 18 10 12 12 K K K K K K K K K

31.3 K

8.5 K 12.5 K

Index Veterinarius (2) Nutrition Abstracts and Reviews Plant Breeding Abstracts Review of Applied Entomology Review of Medical and Veterinary Mycology with(3) Review of Plant Pathology (3) Soils and Fertilisers Veterinary Bulletin Weed Abstracts with(2)

41.1 K 37 K

2.5 K 28 K 6.5 K 8 K

22.3 K

7.5 K 12.9 K 25.5 K 4 8 K K

World Agricultural Economics

A 31

Appendix 7 ; Analysis of relevance judgement requirements This appendix provides the argument for the number and nature of relevance assessments for the 'ideal1 collection. This is initially presented in a very elementary form. A summary of the assumptions made, and a tabulation of the numbers of assessments required in different circumstances, follow. Some implications of the approach are then discussed. In the last section an alternative presentation in more conventional statistical language is provided.
A

Elementary presentation The essential object of our calculations is to ensure that adeauate relevance information is collected for the evaluation of future experimental results, in the case where exhaustive relevance assessment is impossible. In the past, test data has either been 'globally1 exhaustive in the sense that the entire collection is assessed for the test reouests, so that the status of anv document retrieved bv a new strategy, i.e. indexing or searching device or procedure, is known; or 'locally' exhaustive in that some or all of the output of particular strategies being considered is assessed, so that the performance of these strategies can be compared with respect to the combined assessed output for the strategies. The problem encountered in considering relevance assessment for the 'ideal' collection is that while global exhaustion is not possible, local exhaustion as conventionally defined cannot be used for future strategies since these mav produce output not related in a well-defined way to the initial output for which assessments are provided: i.e. the new output is neither included in the assessed output nor overlapped with it in a coherent way; and if an attempt is made to meet this difficulty of local exhaustion bv making the initial searches so broad that their output is likely to be exhaustive of future output, this appears to implv that an unacceptably large number of assessments have to be made. The question is therefore whether the initial output can be obtained and assessed, at the time when the 'ideal' collection is set up, in such a way that future experimental output can be properly evaluated. Essentially, our argument is that under suitable conditions, this can be achieved by sampling from the initial output: that is, that in the collection building, we conduct searches for the given reauests (i.e. based on the given need statements), probably a variety of alternative searches for each request, and establish a pool of retrieved documents for each request. From this pool a sample is drawn for assessment. This sample constitutes the set of documents of known relevance status which is used to characterise, and more importantly to compare, performance for new strategies. Our argument has two components: it covers, first, the way in which future experiments are to be conducted, i.e. comparative evaluation is to be carried outt and second, the characteristics of the relevance data needed to support this evaluation methodology. 1. evaluation The object of a retrieval test, at the lowest level, is taken to be a comparison between two strategies, A and B, representing different choices of indexing, searching, or whatever. As indicated in the Report

*

A 32

text, we will for clarity take these to be two strategies not used to generate the 'ideal' collection itself, though either or both can in principle be generating strategies. To compare the two strategies, we consider only that part of each output that has alreadv been assessed; the remainder is discarded. The relative performance of the two strategies is then represented by their relative success in retrieving assessed relevant documents and rejecting assessed non-relevant ones. More specifically, the following assumptions are made about the wav in which such comparative evaluation is to be conducted. We are concerned with recall and precision,* and these are interpreted as probabilities to be estimated by proportions based on samples. That is, recall is the probability of retrieving a document given that it is relevant, and precision is the probability of a document being relevant given that it is retrieved, where these probabilities for a request and a document collection as a whole may be estimated from the proportions of relevant and non-relevant retrieved by a strategy from a proper sample of the collection which is fully characterised for relevance. To establish a significant difference in performance, over a set of requests, between strategy A and strategy B, we apply the iign test. We base it on the assumption that a percentage difference, say of 5%, between the recall or precision performance of the strategies for a single request is represented by Prob - Prok)ti * ^ % ' an(* o v e r a H the requests we look for a * particular significance level, say 5% or 1%, and want the test to have a particular power, say 95%. That is, an individual measurement for the application of the test is a single request comparison between strategies A and B, so the set of measurements is the set of comparisons over the complete set of requests. We also assume that the sampling distribution for the performance measurement comparisons being considered, i.e. the differences of proportion representing recall or precision, is normal; and for convenience we assume a normal approximation to the binomial distribution for the power of the test. Finally, the overall assumption is made that the probability of strategy A being superior to strategy B is constant over the request set. 2. data If we are thus to evaluate performance comparatively, this imposes certain requirements on the assessment data needed. The evaluation cannot begin without assessment information, so the requirements concern the amount and properties of the assessment data exploited in the application of the test. The essential requirement is for a certain number of assessments overall; for practical reasons this can be referred to in terms of the number of requests required and the number of assessments per request, but the two are inversely related so the total of assessments is the same. Clearly, the fundamental requirement for the whole process is that the relevance status of some of the documents retrieved by strategy A and by strategy B should be assessed. Thus it is not useful to provide asessment data in the initial collection creation by assessing a random sample of the entire collection in relation to the requests. For a large collection in particular this is likely to find no relevant documents at all. On the whole, 'real' search strategies do better than random sampling, so an effective way of seeking to ensure that some of the documents retrieved

* or related performance characterisations

A 33

by future strategies A and B have been assessed is to provide the assessment data initially by evaluating actual search output. That is, strategy performance is evaluated bv reference to assessed initial search output in order to ensure output overlap, rather than by reference to assessed randomly selected documents. It may further be sufficient to assess a sample of the initial output. However, for this use of initial search output assessments to be valid, the same requirements must apply to the search output, or any sample of it, as apply to the entire collection and any sample of this. Thus we assume, globally, that the initial output as a whole contains all the documents relevant to a request, and all the output of future searches for the request. Further, we assume that any sample drawn from the initial output is a random sample; and that any such sample is also a random sample of the output of a particular strateqv. Taking the proposed evaluation procedure and data requirements together gives specific percentage samples of the initial output which must be assessed to provide adeauate evaluation data for different conditions. In particular, we find that as the number of reauests considered decreases, the size of sample increases (UP to 100%). This data is tabulated below. Since the comprehensiveness requirements of the initital output are only likely to be satisfied in practice by combining the outputs of several alternative searches for a/request, the output is referred to as the pool. The table covers different sizes of request set. The results for each set are independent of those for others: the results taken together simply show how for different sizes of set the number of assessments to be made as a percentage of the pool varies. For each request set, the assessment data is given for a sign test significance level of 5% or of 1% for any comparison between strategies A and B. The table then shows the critical region of the test? the number of individual measurements, i.e. recruest comparisons, favouring one of the strategies (sav A) needed for a significant result; the probability that the measurements will favour A over the set required for 95% power in the test; and the sample size required to identify a difference between the two strategies that this implies: the sample size is the number of assessments for each of the strategies that must be provided, i.e. the extent to which the strategy output overlaps the assessed pool output. The actual formulae used in the numerical calculations are not given here: they are of an orthodox statistical nature. The second section of the table shows the percentage of the pool to be assessed for recall and for precision respectively, for given numbers of relevant documents per request, on average, and for given numbers of retrieved documents. That is, for a reliable recall comparison between two future strategies A and B for 500 requests, say, with an average of 25 relevant documents per reguest in the total collection, '36% of the pool would have to be assessed for a 5% significance level in the sign test. For precision and sav lOO documents retrieved on average, 9% must be assessed. Note that the percentage to be assessed in anv given case is always higher for recall than for precision; and also that for very low numbers of reauests and relevant documents, a difference at 5% or at 1% cannot be established. Note also that the figures are approximate, i*e. have not been worked to a verv hiah level of accuracy.

A 34

Summary and tabulation For reference the assumptions underlying the table can be summarised as follows: 1 for future experiments comparing strategies A and B 1 2 3 4 5 6 7 we evaluate using recall and precision; recall and precision are probabilities estimated by proportions based on samples; we use the sign test for validating performance differences; a percentage difference, say of 5%, between A and B, in recall or precision, is indicated by p**ob - Prob = 5%; a normal sampling distribution for difference of proportions; a normal approximation to the binomial distribution for the power of the sign test; the probability of finding A better than B is constant across requests.

B.

2 for assessment data 1 2 3 4 all relevant documents are contained in the pool; the output of A,and of B, is contained in the pool; a sample from the pool is a random sample; a pool random sample is also a strategy output random sample. The situation being modelled can be illustrated thus:

strategy A random sample of the D O O I pool jrelevant documents

strategy B

A 35

GQ

CD 0) o X 0) •H

C pj O (D P •H a3 ^) P C D C O P Cti CD o p

u
w

0 CD
— • •- »

0 iH (D M

ft
-P8
fc O •pin
C D U U p ?> JH

CD

, <
o ^
o
VO CO

C in M *~ rCOO

ON f<"\

C O O

t— ON O O CO VO

VO C O O O TtCM

VO ^ O O

invo O O ^

( D O U O , M tPJ

O T - O r VO CM CM O

^"CO O

C M ^ O, Pj O in T o o * - ^ o , T - C\J PJ O T~

• •
^vo CM i n

• •
KMA

• •
r n TJ-

• •

• •
MO CO

• •
C\JC\J VO h -

• •
CM CM LT\VO p w 0) P

CM K*\ CM m

CD P 03 U P m bo P •H

p M P

si.
C D 0)
CH

O <M O CM C O «H P T3 C *H £ C O P> ^ CD (Tj

O N N ^ C O O l > - ON

u
o

%

o
H CO M O

M O rH P i n O J) \ & O. * P^ P H C O O Pi

rn *^
O xtVO CO

*t O CM K"\

COVO T-CM

V D O - 3 - C O CM VO T" CM T - T - T~ T~

( M ^ f OC\J T~ T~ T~ T~

CH P Q CO C CM a> p -P

o

c

3

COO ^VO

VO CM C v J O K M A rOv^t fr\ VOIA
T-T—

COVO TtC\J r^oo O ^ c\J r r \ CM h n CM C\J C\J CM m ON CM
T —

iH ^ < O D

>

a> II CD

vo
T-

KN VO

0) O Cp

rH i n O C t — ^ O, D M U Pi •HO U n * O p|

a
CH

i:
-p

O CO V O O CM CM T - C M

CMC^ O K N
T — t—

C O O CO ON VO CO

Og 6
O iH O

t ; 0 o3 Q)

n C D o <; CD P CO o TJ C H e co - H 0 <D CO 0 1 C CH D a P co a) ^: o P p 2 c^ H P. CD C P D C £> rP 0 D P P . P C D H • P P C °H D C D

o

a

o

>>o J

O

r-

^ O CMhTN

COVO T-CM

V O O "*fr CO CMVO CM ^ t O C M T - C M T - T - T - ' T - T - T - T" T -

p H LTN O O «t 0) C M ^ R O , VO CO U It Pi HO
ft

p ,0

N pq - H •H - H rH CO W o ,Q O C D M C U ,0 - P &• g ^
•H

•a

00 O ^-VO

VOCM C M O COVO rtCM K M AK\^ CM r r \ CMr<"\

- ^ 00 O ^ t CMCM CMCM

•H M C O O II II II

CD P< W <tj
H

S*-^
II

O

p^

o, * *

O

^

O

O

O

O

O

O

O

O

O

O

* *

ON* c o o t—ovvoco

vo c— invo *
•H TO CO t ^ p^

Hin 01 C D \& o M II Pi

<;

* * * * * * * * * * * * * * §*

pq «sj i n P^ 1 & 1

inr
T— CM

C i n C N rn c o o i>-o\ vo co vo i>- invo M T
T - T— T — T-

u
CD o in o <H CT\ P*PM

vo vo invo

i n «^OCM

CMVO ONO

T-in COON

in i n i n i n i n i n m m i n i n i n in

r f t— " r C — O

a\ o voco

^tvo vo c -

T- Tvo c —

co r invo

r — m vo t— T-T-

o vo CM O N CM CM CM CM CM CM

^t- CM CM r n rn K \

vo -^tr-co KV m

CO l > CM r n

O CT\ C0 CO ^ ^

CM T K \ «Rf in in

0 P

CO

o vo

o vo

o vo

O vo
CM CM

• •
CM CM

• •
CM CM

• •
CM CM

O vo • •
CM CM

O vo o vo o vo • • • • • •
CM CM CM CM CM CM

c
H P,

hQ l^< e n p C O 0 2 a1 C D H

in

T-

in T —

i n T-

i n T- rn-r- i n -

i n T-

i n T-

co 0) 0 0 0 0 •sf O CD in 0 0 0 0 00 0 GD O O CTv <D O O

• «
0

m

vo

I
•H

A 36

Discussion There are two obvious limitations in the model: a) the probabilities of difference are not likely to be constant across requests. However a general form of the central limit theorem might be exploited to modify the model to deal with this; b) all the relevant documents for a request, and all the retrieved documents for a strategy, are unlikely to be in the pool. But since the pool is onlv used as a base for comparing two strategies, the uncertainties might be equalisable. That is, we believe that the type of procedure used to generate the table data could be elaborated to deal with these problems, and hence provide assessment percentages for a greater range of contexts. We emphasise that a short statistical project covering such investigations is desirable. We do not believe it would show the whole approach to be mistaken: it would rather provide fuller information covering more contingencies, and could well also show that satisfactory experiments could be conducted in less stringent conditions than those considered here, without material implications for the cost of provising the assessment data. Indeed a more carefully detailed statistical design could well show that the cost of providing the collection could be reduced. In this connection one particular practical implication should be noted. Choosing a particular size of request set and assessing for it would apparently imply that in any future experiments all these requests would have to be used: this might well be inconvenient. A question therefore also requiring statistical investigation is the 'tolerance1 of given request and assessment data for sampling: i.e. if 700 requests are provided with, say, 28% pool assessment, can this information be used to evaluate performance for a random sample of, say, 300 of the requests? It appears not, since 300 requests in principle require 60% assessment, for the same number of relevant documents per request. It may, however, be the case that a detailed statistical analysis would show that some compromise would be adequate, so that, for instance, the initial data could be provided with 700 requests and (suppose) 45% assessment, which would provide information acceptable for experiments with a sample of (suppose) not less than 300 requests. (A perhaps safer alternative would be to provide, on collection creation, a random sample of the requests with exhaustive pool assessments: but note that the general statistical argument would require that this sample should not be too small.) Clearly, the practical implications of the most critical assumptions, 2.1 and 2.2, are important, since they affect the search procedure used to generate the pool. In practice, therefore, some idea of sensible pool-qenerating procedures is needed, which must be buttressed by sampling to see how far 2.1 and 2.2 are met. However, discussions suggest there is no overriding difficulty about providing suitable alternative strategies for this, the only practical consequance beinq that an •exhausting' pool is likely to be large, so more assessments are needed. Observation in different iJ^vSYvS^Sft^^BSu^stS^6 P a s t s u 99 e s ts that, for example, for 30,000 documents/a pool of size 3000 could be expected to meet 2.1 and 2.2, and for reauests with few relevant documents on average, the pool could well be smaller. The practical implications for assessment of this point are discussed in the Report text.

C.

A 37

The most important point about the whole argument is that the design is consistently for the worst case. Thus the sign test is a weak test adopted because there may be insufficient knowledge of the collection structure to support the application of a stronger one such as Wilcoxon. However, if the data structure is known, anv data to which the sign test applies is in these circumstances also a field for the use of Wilcoxon. A second illustration of this point is that the assumption is that the outputs of strategies A and B are independent: but in practice some relevant documents seem to be more easily retrieved than others, which implies that the outputs are not likely to be independent. However, in this case the power of the test is simply increased, so the proposed design in itself covers this case. D. Statistical presentation 1. We assume that what we are trying to tablish is that there is a significant difference between two probabilities (or two proportions) based on sample estimates of them. Throughout we use the normal approximation to the binomial, that is N(0,1) ~ x - np v/npd - p)
n

-^ Oo

(1)

where x is the number of successes and p the probability of success. 2. For significance test we choose the sign test because it makes few a priori assumptions about the data. For two strategies A and B we order each request in terms of effectiveness, i.e. effectiveness of Q under A ^ effectiveness of Q under B. Effectiveness here is either precision or recall which are assumed to be probabilities. The null hypothesis ( ) is that there is no difference, i.e. Prob ( > B) = Prob(B> A) = h . H A Since the test is based on the binomial distribution we can use the approximation (1) to find the critical reqion, that is, that value of the standardised normal variable which needs to be exceeded for H to be rejected at 5% significance level. If k is the number of requests, then under H : p = h and we qet x - k/2 = 2x - k

Using normal tables (Hoel, 398) we find 2x - k ^ ~

A

*

gives 5% significance. This means for k = 100 (requests) we must have at least 60 A's > B's say. 3. The above is all we would need to be concerned with if there were no uncertainties in the probabilities we are comparing, that is, no uncertainty for precision or recall at each request. Unfortunatelv our decision whether A > B or B > A is based on two samples, one for A and one for B. So that even if there is a real difference between A and B, because we are sampling this difference will fluctuate. Of course were we to take infinite (read, very large) samples we would get the true difference. Assume now that the probabilities we are trving to estimate (recall and preci^on) are constant across requests; we can then calculate a minimum sample size for each request (it will be the same) necessary for the sign test to show a significant difference. To do this we must assume viiat the real difference is. Obviously, the bigqer the real difference the smaller the sample size necessary to reflect it. There

A 38

is a sampling theorem for differences (see Hoel, 149) which again allows us to use the normal approximation to the binomial. The effect of using the theorem is for us the calculation of P(x ^ x ) for any given n (sample size). Conversely, given the P(x ^ x ) we can calculate the n necessary to achieve it. Once we have done this the constancy across requests will tell us the expected number of requests with A > B. Conversely, given the number of A ' s ) B's dictated by the sign test and letting it equal the expected number derived above, we can choose a sample size to achieve the expected number. Because we design for an expected number it is reasonable to assume that 50% of the times the number of A ' s > B's will fall below the critical value and 50% of the times above. But we would like a higher chance of significance, or to put it another way, a higher chance of rejecting the null hvpothesis if it is in fact false (i.e. P^ - P„ = 5% is true). This can onlv be done
TV

M

by increasing P(x N x ) (or equivalently increasing the sample size). We want to ensure a 95% chance when P - P = 5% that the number of A ' s > B's will exceed the critical value. In other words, for what value P(x > x ) will it be the case that there is a 95% chance of significance This we again get bv using the normal approximation to the binomial. We may illustrate the relationship between the critical region defined bv x > x and a 50% or 95% power of the test by the following i S c very crude diagram:

50% sampling curve 95% sampling surve

/

/l
1 I

KftW&i^s^
X /

of successes

95% probability that
X > X

*
If k = 100
X

c

- 60 c

H

o * (P = h)
50% chance of 5% significance 95% chance of 5% significance

H' = (P = .60) 1 " = (P = .68) H 1

Comments: a) Once we have the sample size we can use it to calculate the percent o^f the pool. The basic idea is that we want a random sample from the future outputs and relevant documents big enough to estimate precision and recall. For this we need assumptions 2.1 and 2.2 of section B above. b) The table given earlier shows a number of alternatives. One can do with fewer requests by increasing the number of assessments per request. c) The sign test could be replaced by a stronger test, in which case the design would be somewhat cheaper. P.G. Hoel, Introduction to Mathematical Statistics, 3rd Ed., Wilev, 1^62.

A 39
ftirandix 6 : Reaearch project questionnaire. POSSTHLT? P^SA^CTT PR'^CT TTaI-T~ TrTC f TT^].' T?3T COTJ^ryPTO'T The fide^lf retrieval test collection.is intended to permit a variety of controlled indexing end retrieval experiments on real material, to encoumg^ inter-project compai'isons, and to reduce date preparation effort. Tt would consist essentially o ~ ? large set of bosio document descriptions, from wMch ~ different subsets with particular properties and fuller descriptions could be drawn: of off-line and on-line queries; and of associated relevance judgements The collection would be set up in a well-organised way, and -would be available in machine readable form. The first specification for the collection is given in K0 Sparck Jones and C»J. van Rijsbergen, "Report on the Need for and Provision of an 'Ideal1 Information Retrieval Test Collection", 1975? a ~ore detailed one is provided by K . Sparck Jones, "Outline Specification for the 'Ideal1 Information " Retrieval Test Collection", 1976, both available from K. Sparck Jones*

Project topic

Objective

"lethodoV'-'

A 40

Data r o r m i r ^ o n t s a) content

b) form (machine/manual)

Scale a) time; 1,2f5» or more years

b) man-power: 1-?, 5-4* 5-6* o? nor a staff

Status would like to start as soon as material is available (if not, is this because of other commitments, or because project is tentative)

Name

Address

A 41 Appendix 9 : Teaching and on-line education questionnaires

INFORMATION RETRIEVAL TEST COLLECTION: USE FOR TEACHING AND RESEARCH IN DEPARTMENTS OF COMPUTING, INFORMATION STUDIES, OR LIBRARIANS HIP

1 a) Topics under the general headings of information or data management, processing or retrieval, of interest to your department:

b) Topics specifically studied in courses:

2

General data requirements, e.g. type and volume of material: for 1 a) s

for 1 fo)

3

Levels of study, and numbers of students involved, in information processing: undergraduate, 3 years : 2 years : 1 year :

p o s t g r a d u a t e , diploma

: : : Name Department Address

master*s degree d o c t o r ' s degree

A 42

THE •IDEAL1 INFORMATION RETRIEVAL TEST COLLECTION : POSSIBLE USE IN CONNECTION WITH ON-LINE EDUCATION

1 Do you, or are you intending tot teach on-line searching?

2 If so, do you think that such data as that contained in the proposed test collection, if set up on a convenient computer, could be of value for your teaching activities?

3 Have you any special requirements in mind?

4 Would you expect or like to be able to use a local computer, or have to rely on remote access?

5 Number of students likely to be involved: a) undergraduate b) postgraduate

Name Department Address