AUTOMATIC 1974

INDEXING

A State Review of the Art

KAREN

SPARCK

JONES

April 1974

Computer Laboratory University of Cambridge Corn Exchange Cambridge CB2 Street 3 0 6 , England

This review was prepared under Grant No, SI/G/096 from the Office for Scientific and Technical Information, Department of Education and Science (now the British Library Research and Development Department).

The review was discussed at a Workshop held at Crawley, Sussex on April 29 and 30 1974. This was sponsored by the British Library Research and Development Division, and was organised by the University of Kent, Canterbury. A Report on the Workshop by Miss E. Wilson, Computing Laboratory, University of Kent, is in preparationo

C O N T E N T S

Preface

: terminology etc.

I cl ,2 .3 .4 II •1 c, 2 III cl o2 .3 rA o5 IV

Introduction The problem Evaluation Information retrieval:historical and general background Related areas Automatic Indexing Procedures Semantics and syntax Secondary system factors Syntax Input syntax Description syntax Index language syntax Search syntax Conclusion on syntactic indexing

Semantics *1 Statistical semantics «2 Input semantics 3 Description semantics ,4 Index language semantics c5 Search semantics ,6 Conclusion on semantic indexing cl ,2 Indirect Indexing Citations Document clustering Mechanised Systems Standard systems Non-standard systems Related Areas Automatic abstracting and extracting Question answering (fact retrieval) Evaluation Experiments Review of experiments The SMART Project Conclusion Overview Recommendations

V

VI .1 o2 VII J 02 VIII «1 .2 IX •1 ,2

P R E F A C E

Terminology

Information retrieval

= document retrieval = reference retrieval = documentation (= retrieval). (alias question answering) direct extraction of facts from an information store (or obtaining of answers therefrom) provision of descriptions of documents for retrieval purposes. provision of extracts of document texts for general purposes. provision of summaries of documents for general purposesc strictly, automatic provision of document descriptions for retrieval. But in this review extended to cover all the linguistic operations involved in analysing, describing and searching for documents, and in creating the index language required. of documents to select information required for indexing. of documents by characterising selected information in an index language. of file of document descriptions to find descriptions meeting a request specification. the language of documents, their titles and abstracts, and of requests. the language used to describe documents and requests, which may be more or less artificial. specifically the actual articles in a library, but more generally the form of the document input to indexing, whether full text, title or abstract. (alias representative) the input form of the document when this is not its full text. the index language characterisation of a document. in analysis or indexing, any word or sequence of words treated as a whole. a word in natural language. P.l

Fact retrieval

Indexing

Extracting

Abstracting

Automatic indexing

Analysis Description (1) Searching Natural language Index language

Document

Surrogate Description (2) Unit

Word

Keyword

:

a word or word stem from the input natural language adopted as an index term. a word or word group in the index language used to describe documents; a simple rather than complex item. also a word or word group in the index language; but normally a complex or compound item. any unit in an index description, whether a keyword, term, subject heading or class label. as ordinarily used. ditto. the set of descriptors of an indexing language. any grouping of items, but ordinarily applied to descriptors. classification applied to documents. (alias syntagmatic) refers to relations between words or descriptors not constant in the language. (alias paradigmatic) refers to constant relations between words or descriptors; also simply to their own meaning. retrieval question, in general, the set of documents in a retrieval system; but specifically, the set of documents and requests used for a retrieval experiment. the ratio between relevant documents retrieved and all the relevant documents in a collection, for a request or set of requests. the ratio between relevant documents retrieved and all the documents retrieved, for a request or request set0 extent to which relevant documents are retrieved (in a general, not precisely measured way). extent to which relevant documents only are retrieved (ditto). of a retrieval system, measured in appropriate ways, for example by recall and precision. interesting (as well as statistically significant) difference in performance, very interesting (ditto) difference in performancea P.2

Term

:

Subject heading

:

Descriptor

:

Precoordinate Pos tcoord inate Vocabulary Classification

: : : :

Clustering Syntax

: :

Semantics

:

Request Collection

: :

Recall

:

Precision

:

Pullout

:

Selectivity

:

Performance

:

Noticeable

:

Material

:

Data references A particular convention will be adopted for specifying test collections. Thus an expression like "a 21x379 tropical foodstuffs collection" refers to a set of 21 requests and 379 documents dealing with tropical foodstuffs. These normally have associated relevance judgements, i.e. sets of documents defined as relevant to the requests. Items like "the 42x200 Cranfield collection" refer to well-known test collections. As collection specifications are in the form "mxn", the use of "m" or "n" refers to collections for which full details are not given, "m" representing an unknown number of requests and wn" of documents. The form "nK" is used where the number of documents is not given, but is manifestly large.

Literature references To avoid overloading the text a slightly abbreviated form of reference has been adopted. "Snooks 1969,1970", for example, refers either to papers by the lone Snooks, or to papers of which the first listed author is Snooks, or to papers of which he is deemed the lead author because he is project director or most consistent author in a series of project papers. The relevant papers are all listed under Snooks in the bibliography. One or two sets of references use project or organisation names as leads.

Literature coverage Sparck Jones 1973a attempted to cover much of the relevant literature for the period 1965-1970. This survey concentrates on the more recent period 1968-1973. It is based chiefly on papers published in the Journal of Documentation, the Journal of the ASIS, and Information Storage and Retrieval, but chapters like that dealing with question answering use material from other sources. I have also exploited other surveys, notably the relevant chapters of the Annual Review of Information Science and Technology (Cuadra 1966-). General acknowledgements must be made to Cleverdon 1966, Coyaud 1966, Lancaster 1968b, 1972a, Salton 1968a and Sharp 1965.

Organisation of the review Section I defines automatic indexing for the purposes of the review and provides background on information retrieval and linguistics. Section II lists the linguistic components of an information retrieval system, from the point of view of syntax and semantics, and considers other system components which may be expected to influence indexing

P.3

performance. Sections III and IV deal with syntax and semantics respectively, describing syntactic and semantic approaches to document analysis, description and searching and to index languages. Section V examines 'indirect1 indexing, represented by the use of citations and by document clustering. Section VI considers automatic indexing in operational mechanised retrieval systems, both standard and 'non-standard1, the latter covering interactive retrieval and the use of printed indexes. Section VII makes a brief survey of work in the related areas of automatic abstracting and automatic fact retrieval. Section VIII summarises the main evaluation experiments involving automatic indexing, with special reference to the Smart project. Section IX attempts an overall conclusion and recommendations for the future.

P.4