VIII-1 VIII. Bibliographic Data as an Aid to Document Retrieval J. W. McNeill and C. S. Wetherell Abstract The hypothesis of this project is that bibliographic data added to a SMART style document collection will improve retrieval effectiveness. Two uncommon kinds of bibliographic data, authors and These matrices place of publication, are used to build concept matrices. are used with associated queries in a typical retrieval environment involving relevance feedback. With the aid of a new statistic, it is found that these matrices actually do aid retrieval. 1. Introduction Intuitively, bibliographic information is one of the most valuable tools available for a search of technical reference material. This fact is recognized by several authors. Salton [10,11] describes a general method for incorporation of bibliographic data into a SMART style retrieval system. Garfield {2] writes of the importance of citation His work has had some indices for literature searches in the sciences. practical effect, as the Science Citation Index published by the organization he heads is now a standard library item after only six years of publication. Recently, Garfield has announced automated search techniques These too are based of current literature, available for a modest fee. VIII-2 on the citation index. Finally, Kessler {4,5,6,7,8,9] discusses the use of bibliographic coupling as a retrieval method. He also notes the importance of the journal of publication as a clue to the content of technical documents. In report ISR-12, Amreich, Grissom, Michelson, and Ide [1] follow Salton's suggestion and attach a citation index to the concept matrix of the ADI document collection. They showed that retrieval is about as efficient using the index alone as it is with the original matrix, and that combining the two types of terms results in a significant improvement in retrieval effectiveness. This result is perhaps not typical since the ADI collection, although small, is heavy in cross-reference. The present project originally expected to check the results of Amreich, e b al. , with another less heavily cross-referenced document j collection and then to extend the work toward that of Kessler. However, the collection chosen has so few cross-references that its citation index is, for practical purposes, null. This fact made it advisable to use other easily available bibliographic data. Several other thoughts reinforced this decision. First, although bibliographic data in all its fonts have long been recognized as a tool for document retrieval, only citation indices and their derivatives, coupling indices, have actually been tested for use in a mechanized system. Two other major sources of information, author and place of publication, have been neglected in the literature. VIII-3 Second, bibliographic information has a double practical importance. methods. It is easy to obtain such information through automated Also, bibliographic data, even more than keywords or subject indices, reflect the position which the author feels the document holds in the present literature. Kessler discusses this point and his examples make it very clear that bibliographic data can be used to chart the mainstreams of physics. Although subject indices certainly have their uses, by their very construction, they cannot portray this implicit but usually accurate evaluation of a paper's standing in the literature. The hypothesis upon which the project is based is the following: Bibliographic information other than that embodied in the citation index will, when added to a conventional concept matrix, improve retrieval effectiveness. This improvement may be demonstrated through The project is a test the use of a statistic developed in this paper. of this hypothesis. 2. The Experiment The document collection used to test the experimental hypothesis is the small MEDLARS collection. This is a set of 273 documents conThe cerning various aspects of medicine and 18 associated queries. concept matrix which describes the collection is a word-form matrix consisting of approximately 500 concepts and a basic concept of weight 12. The practical weight range is 12 to 60. VIII-4 The 18 queries each consist of two parts. The first is a query concept vector which is constructed exactly like a document vector, except that it is usually much shorter. In addition, each query has These relevance associated with it a set of relevant documents. judgments are used to calculate the efficiency of a retrieval method applied to the collection. The basic direction of this project is the compilation of six new concept matrices describing the collection. 1 through 6. These matrices are numbered Along The number 0 matrix is the original word-form matrix. with the new matrices, new query sets are constructed. These query sets operate with the same sets of relevance judgments as the originals, but have concept vectors drawn from their respective concept matrices. The bibliographic data upon which the matrices are based are the following: Matrix 1 2 3 4 5 6 Data Author of document Author of citing document Author of original or citing document Original place of publication Place of publication of citing document Place of publication of original or citing document For matrix 1, the authors of all the documents in the collection are listed. Authors of two or more documents are given concept numbers. Concept vectors are generated for documents in the standard way, with each concept having a weight of 12. For matrix 2, a similar procedure is followed using a list of all the authors who cite a document in the collection. VIII However, authors may cite, in different papers, a given document more than once, so that while the basic weight remains 12, the actual weight for a concept reflects the number of times the document in question has been cited by the author associated with the concept. In matrix 3, the above two lists are combined. Every author who The is associated with two separate documents is assigned a concept. basic weight for a citation author is 6 and for an original author 12. The weight for a concept is computed by summing basic weights for all of the author's contacts with the document in question. Matrix 3 is larger than a simple union of matrices 1 and 2. If author A wrote only document 23 and referenced only 211, he will not appear in either matrix 1 or 2, but he will appear ,as a concept in matrix 3. This does, in fact, add a great deal of information to Matrices 4, 5, and 6 are constructed in the same matrices 3 and 6. manner from place of publication information. Query sets are needed to operate with these matrices. Each set contains 18 queries, corresponding to the original 18, and each of the 18 has the same set of associated relevant documents as the corresponding original. follows: 1. The union is taken of the document vectors in relevance set i included in matrix j to form a list of concepts relevant to query i. 2. If the list has less than four elements, enough random concepts are added to make four. Construction of query i for matrix j proceeds as VIII-6 3. If the list has four or more elements, concepts are deleted using a random binary distribution. (deleting concepts At least corresponding to the zeros of the distribution). four of the original list are left in each reduced list. Finally, two concepts are added at random. 4. 5. 6. If the list is null, the query is assumed to be null. Weights for all concepts are set to twelve. The random concepts are chosen from the matrix j. The rationale for this scheme is simple. It is assumed that queries will contain some concepts which are relevant to the documents which it is to select from the collection, and that the query will also contain some concepts which are not relevant (i.e., are noise concepts). Authors of queries are likely to know some of the authors (or publications or subjects:) in the field they are investigating and they are also likely to make some mistakes in their query construction. It is also assumed that a difference: in weights will not alter the query operation after four iterations, so that it is safe to set all query weights to 12. One point should be emphasized concerning the construction of these matrices. An author or place of publication which has relevance to only ore Thus, retrieval for many queries document has not been assigned a concept. of the form "I only know it was written by J. J. Smith" or "It appeared in the Czech Journal of Sedimentology" will not operate successfully. In this small data base, several queries for the constructed matrices vanish because of this requirement. However, in a large data base, this problem is less acute, since there are then fewer authors of only one document who are: VIII-7 not referenced or do not reference, and few journals with only one relevant document. On the other hand, the same reasoning implies that adding concepts for single authors or single documents would probably not be very expensive. The experiment is conducted in a "typical retrieval environment". Unfortunately, there is no experience to indicate what such an environment is. The assumption is made that this environment would include a standard SMART retrieval system, utilize the cosine measure of document similarity, and use some type of relevance feedback. used is The relevance feedback equation q i+l = % + r i + r 2 ' ni r and r where q. is the old query, q. is the new query, n the first two relevant documents, and document retrieved. in this equation. the first nonrelevant Only documents in the top five retrieved are used Each query is iterated four times (the original query constitutes the first iteration). The matrices and query sets have been designed so that they may be combined by concatenation of corresponding document and query vectors. The experiment consists of running various combinations of matrices and query sets against one another and measuring the retrieval effectiveness of each pair. is used as a control group. The original word-form matrix and query set A matrix-query set pair will improve If none of retrieval only if it does better than the control group. the matrices which include bibliographic data do better than the control group, the hypothesis will have been shown to be false. VII1-8 The matrix-query set combinations run are Matrix i vs query set i, i = 1,...,6. Matrix Oi vs query set 0, i, Oi, i = 1,...,6# Matrix 36 vs query set 3, 6, 36. Matrix 036 vs query set 0, 03, 06, 36, 036. Concatenated matrices and query sets are denoted by ..listing their elements in order. matrix 2. The nature of the document vectors, and the query vectors after several iterations is illustrated with the following example (concept) numbers are shown followed by weights): Thus, matrix 02 is the matrix made by combining matrix 0 and Document 268 (data set 2) 6102 12 6119 12 6123 24 Query 8 for data set 2 (Iteration 1) 6102 12 6110 12 6119 12 6129 12 Query 8 for data set 2 (Iteration 4) 6102 6125 36 24 6110 6129 12 24 6116 12 6119 60 3. The Statistical Measure The hypothesis proposed can be considered validated only if it can be shown that the expanded data bases actually produce better retrieval, than the original data base. iveness is needed. To this end some measure of retrieval effect- The measure chosen is a sign test based on rank recall. VIII-9 The rank recall for a query is calculated by the formula n 1 1-1 rr = n i=l where and r. 1 n is the number of relevant documents for the query in question, is the retrieval rank of the i-th relevant document. The measure varies from 1 for the best possible retrieval to 0 for the worst. The sign test is calculated in the following manner. Rank recall is calculated for each query of the control group (rr ) and of the test group (rr ) . The sign value is 18 s - ) agn (rr - rr 1 LJ i=l t g where sgn is the signature function. The difference is calculated to within a standard error of 0.005 on either side. If S is non-negative, retrieval effectiveness is at least as effective for the test group as for the control group. If the absolute value of S is greater than or equal to 6, there is almost certainly a significant difference between the test group and the control group. It is reasonable to assume that is the absolute value of S is greater than or equal to 3, there is probably some difference between the groups, The direction of the difference depends on the sign of S. VIII-10 A second statistic of interest is obtained when matrix i is run against query set i. In this case, retrieval is not good enough to bring S above -10 or so. This is so because the bibliographic data matrices all have at least 60 null document vectors, and do not contain enough information to compete with a matrix of a size equalling that of the basic MEDLARS matrix. However, the question arises whether these query The retrieval set-matrix pairs do better than random queries might. method used only does better than random if the rank recall for a query is larger than ^ n Id. = i=! 3 n i=l where n is the number of relevant documents for the query and N. is the document identification number of the i-th relevant document. Using the d. 3 in place of the rr , the statistic D is calculated in exactly the g same manner as S. Again D must be non-negative if these queries are judged as performing better than random more than half the time. To facilitate a graphic illustration of the most significant results, average recall precision curves are presented. The average value of recall and precision is computed over all 18 queries for 14 different cutoff levels using the following formulas: VIII-11 Recall = Number of relevant documents retrieved Total number of relevant documents Precision = Number of relevant documents retrieved Number of documents retrieved The cutoff values are chosen to provide results over the entire range of possible recall precision pair values. 4. The Results The results of the experiment can be summarized using the S and D statistics. Table 1 shows the results of running query set i against matrix i, i = 1,...,6, and 36. i 1 vs 2 3 4 5 6 36 36 36 1 vs 2 vs 3 vs 4 vs 5 vs 6 vs 3 vs 6 vs 36 S -18 -17 -14 -14 -18 -11 -17 -12 - 9 D -2 6 11 15 14 16 10 16 16 Statistics for Bibliographic Matrices Alone Table 1 In table 2, the results of running matrix Oi against query sets 0, i, and Oi are presented. ML 1 2 3 4 5 6 S for 0 0 -1 0 0 -2 1 S for i -10 - 8 - 3 1 - 6 - 9 S for Oi Not available -3 1 5 -1 4 Statistics for Combined Bibliographic and Original Matrices Table 2 VI1I-12 In table 3, the results of running various query sets against matrix 036 is presented. [ Query S e t 0 03 06 36 036 S" I 2 4 2 -4 5 Statistics for Combined Matrix 036 Table 3 Recall precision curves are presented in Fig. 1 for the following three sets of results: 1. 2. 3. 5. Original queries against original matrix 036 queries against 036 matrix 36 queries against 36 matrix Conclusions The most important conclusion to be drawn from this experiment is that the hypothesis has been confirmed. A number of entries in Table 2 show that the addition of bibliographic data midly improves retrieval. The last entry in Table 3 shows that using full queries against a full matrix of original and bibliographic data improves retrieval effectiveness signficantly. This conclusion may be reached on the basis of either the sign test:, or from the recall precision curve. Using only the 36 data base, document retrieval is quite effective considering that only 230 concepts are included in this matrix. This amounts to 1/25 of the number in the original matrix. Furthermore., these concepts are scattered fairly sparsely through VIII-13 D Original queries against original data base. • 0 3 6 queries against 0 3 6 data base. A 3 6 queries against 36 data base. I.OOp .80L .40h .20K .20 .40 .60 80 1.00 RECALL R e c a l l P r e c i s i o n Curves f o r 0, 3 6 , and 036 Data Base and Query S e t s Fig. 1 VI11-14 the documents with about 60 documents having no bibliographic data attached to them at all. Also, the data chosen were not regarded, in advance, as having much value as a retrieval tool. This suggests that the addition of citation index data would strongly improve retrieval once more. If the document collection is conceived of as a growing structure, new entries will tend to cause bibliographic concepts to be added to the concept matrix. Further, the number of concepts added will probably This means be linear with regard to the number of documents added. that the concept matrix will grow unboundedly as the document collection grows. In a practical system, this may not be allowable. If so, techniques to discriminate between useful concepts and concepts which do not carry their weight will have to be developed. VIII-15 References [1] Amreich, M., Grissom, G., Michelson, D., and Ide, E., "An Experiment in the Use of Bibliographic Data as a Source of Relevance Feedback in Information Retrieval", Information Storage and Retrieval, Report ISR-12 to the National Science Foundation, Section XI, Department of Computer Science, Cornell University, Ithaca, New York, June 1967. Garfield, E., "Citation Indexes for Science", Science, 122, 3159, July 15, 1955, pp. 108-111. Ide, E. "User Interaction with an Automated Information Retrieval System", Information Storage and Retrieval, Report ISR-12 to the National Science Foundation, Section VIII, Department of Computer Science, Cornell University, Ithaca, New York, June 1967. Kessler, M. M., "An Experimental Study of Bibliographic Coupling Between Technical Papers," M.I.T., June, 1962. Kessler, M. M., Bibliographic Coupling Between Scientific Papers", M.I.T. July, 1962. Kessler, M. M., "Analysis of Bibliographic Sources in a Group of Physics-Related Journals," M.I.T., August, 1962. Kessler, M. M., "Bibliographic Coupling Extended in Time: Ten Case Histories," Information Storage and Retrieval, 1, 1963, p. 169. Kessler, M. M. and Heart, F. E., "Analysis of Bibliographic Sources in the Physical Review (Vol. 77, 1950 to Vol. 112, 1958)," M.I.T., July, 1962. Kessler, M. M. and Heart, F. E., "Concerning the Probability that a Given Paper Will Be Cited," M.I.T., November, 1962. Salton, G., "Some Experiments in the Generation of Word and Document Associations", Proc. AFIPS FJCC, Spartan Books, Philadelphia, 1962. Salton, G., "Associative Document Retrieval Techniques Using Bibliographic Information", JACM 10, 4, October, 1963, pp. 440-457. [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]