X-l X, A Relevance Feedback System Employing a Dynamically Evolving Document Space M. C. Davis, M. D. Linsky, M. V. Zelkowitz Abstract Methods for improving precision and recall in information retrieval have been based mainly on query modification or temporary document transformations. The present study investigates results obtained when, in addition to modifying the query, the document collection is considered as a dynamically changing space which is continually improved to reflect, more accurately, the contents of the documents it contains. This alteration is achieved by reclustering the documents based on relevance feedback, so that future queries can benefit from the results of processing previous queries. 1. Introduction Information retrieval is fundamentally concerned with the selective retrieval of information which is pertinent to an inquiry from a large source of data. A comprehensive manual search covering even a small portion of available information is clearly impossible when dealing with a large library containing several million volumes. Current card catalogue oriented systems have proved to be useful tools towards the realization of more efficient, exhaustive scanning of information files, but intrinsic difficulties resulting from requirements imposed upon such systems by the alphabetical nature of these files either renders them unwieldy or incomplete. Innovations X-2 in the computing field have led to the notion that the retrieval problem can pragmatically be coped with only by using computing devices programmed to simulate personal inspection of possible relevant information. SMART document retrieval programs are designed to accomplish such a simulation. Basic SMART retrieval involves two major procedures. Initially, The syntactic and semantic considerations are employed to automatically construct a concept vector with numerical components which will essentially act as the query itself when future reference is made to that query. When a query is processed, its concept vector is compared with the concept vectors of all of the documents in the collection (these document vectors are derived in a manner analogous to that for the concept vector for queries) and the cosine correlation, a measure of the similarity among queries and documents, is obtained. It is assumed that the probability of relevance of a given document to the query at hand is greatest for those documents whose correlation with this query is highest. Thus, the user will be presented with identification numbers of the documents with the highest correlations. Due to the syntactic and semantic impreciseness of the English language as well as the user's possible uncertainty pertaining to the exact information which he is seeking, standard means of condensing or reducing documents by automatic procedures such as, for example, statistical term associations and frequency counts of particular words and phrases, are not definitive enough to produce a space which is an exact image of the original documents and queries. Thus, it has been hypothesized that systematic X-3 alteration of these indexing products based on user relevance feedback, can be employed to rectify the effects of any misinterpretations of user intent or document emphasis, to produce an "improved" or "refined" document space. For instance, studies have been made to evaluate systems which require that the user return judgments indicating which of the retrieved documents are of value to him. Based on these personal relevance judgments, the system then processes a modified query which reflects the feedback indications. That is, more emphasis is placed on documents which bear a marked similarity to the documents previously found to be relevant. The expected improved results, which have been demonstrated in [1], indicate that more relevant documents can thus be retrieved. There exists a mathematical justification to support the expectation that such query modification will result in more effective retrieval. Since documents and queries are vectors with numerical SMART components, they can be considered as points in a vector space. normally retrieves all documents in the vector space which lie "close" to the query (see Fig. 1(a)). Hopefully, modifications based on relevance feedback can be used to move the query to a new position in the space. Ideally, a greater density of relevant documents will be centered about this portion of the space. Though such a query modification does rectify to some extent the imperfections in the concept vectors corresponding to the user's queries, it has no effect on the document space itself. Any inadequacies in the original document space will exist throughout the life of the collection. It is, therefore, contended that the document space must also be altered if optimal results are to be obtained. 4 a) Typical SMART Retrieval b) Typical SMART Retrieval with Relevance Feedback c) Typical Retrieval from Modified Document Space d) Typical Retrieval from Modified Document Space with Relevance Feedback Retrieval Illustrations Fig. 1 X-5 Implementation of a SMART-like system with the inclusion of the above query alteration technique will be notably unsuccessful in handling situations where relevant documents are clustered about distinct points which are distant from one another as in the illustration of Fig. 1. The query, when altered, will be moved to a position which is close to or within one cluster, but far from the second cluster of relevant documents. totally ignored. This second group of documents will, therefore, be The assumption that distinct documents which are found to be relevant to a given inquiry are in fact interrelated, leads to the contention that a reclustering of the document space based on relevance feedback data, is highly desirable. If this is done, determination of a single relevant document will easily facilitate the recovery of others. Whereas query modification involves a temporary change,since it is unlikely that a given query will exactly duplicate another one submitted at a future date, the proposed document space revisions are permanent in the sense that each updated document space is a refined version of the current space, not the original one determined by SMART. This report is concerned with the reclustering of documents within a dynamic document space and the corresponding effects of these modifications on retrieval results. Control cases are examined to provide a basis for the evaluation of the effectiveness of the proposed method. 2. Proposed Study The concept of a dynamic document space is not in itself novel. The work of Friedman, Maceyak, and Weiss also involved the clustering of relevant documents [1]. However, unlike the presently proposed scheme which X-6 retains the effects of the document space modifications during the processing of future queries, the document space treated in [1] reverts to its original form before the processing of the next query. With the proposed system, the methods which control the successive alterations of the document space are based upon the following assumptions: a) For a given query, concepts which appear more frequently in relevant documents than in nonrelevant documents probably contribute significantly to the relevance of the pertinent documents. The significant concepts are related to one Thus, another and often occur in conjunction with one another. by raising the weights of these concepts in all documents within the entire space which contain occurrences of these concepts, similar documents are brought closer together; b) Any relevant document (as determined by user feedback) which does not contain an instance of a given concept determined to be significant is likely to contain material which nonetheless relates to this concept. that relevant document. Therefore, this concept is added to It is expected that by increasing the weights of these concepts, more relevant documents will be clustered together and ultimately retrieved, when a similar query is processed in the future. It is difficult to determine an adequate criterion for deciding which concepts are, in fact, significant to the relevance of a particular document. A discrimination factor, d . , can be calculated from the quantities . n., where d., r , and n., are defined by equations (1), (2), and (3). I r. and i i i r, a 7 y 1= c u A *keR ****• ' L ; I = no. of elements e R (1). X-7 n. = - ) r i J / J c k,i ; J = no, of elements e N (2) keN d. = tr. - n. )/(r. + n. ) 1 1 1 1 1 (3) c R N . is the weight of concept i in the kth document. is the set of relevant retrieved documents. is the set of nonrelevant retrieved documents• Thus r. is the average weight of concept i in the retrieved relevant K ,1 documents; n. is the average weight of concept i in the retrieved nonThe difference, r. - n, , if positive, is then a measure relevant documents. of how much more important the ith concept is in describing the nature of the relevant documents than in describing the nature of the nonrelevant documents. This measure, when normalized by dividing by the factor, r. + n,, A positive value for d. l becomes the desired discrimination factor, d.. l indicates that the concept occurs more frequently in the retrieved relevant documents than in the retrieved nonrelevant and therefore is of some significance. Clearly, the larger the value of d., the more significant A concept is deemed the concept is as an indicator of document relevance. "significant" if and only if d, > 6 (4) where 6 is an appropriately chosen constant which specifies the minimum value of d., which demonstrates the "importance" of the concept. A reasonable approach to determining the proper magnitudes of the ensuing alterations is to define the increment as a function of d.. All X-8 documents within the entire document space are then modified by the formula: c = c . (1+yd.) for appropriately chosen y. (5) It is evident that if c affect its value. , is originally 0, equation (5) will not K,1 However, consistent with assumption b) above, for documents deemed relevant, the absence of concept i will result in an alteration specified by equation (6) as follows: c, . « e for K e R k,i (6) Since concept weights can never decrease with this scheme, they would grow unmanageably large over a long period of time if no provisions were made to check this growth. Consequently, the documents are all normalized This normalization process serves an to a Euclidean length of 1000. additional purpose. Concepts which are never significant, i.e., have corresponding d.'s which are always negative or negligible positive quantities, are reduced in magnitude due to the increase in the weights of the relevant concepts. In a sense, negative feedback is thereby achieved. The retrieval process can now be specified by the following algorithm: a) Retrieve the top 15 documents (based upon the cosine correlation with the query). b) Obtain relevance feedback judgments from the user concerning these 15 retrieved documents (Our experiments relied on a priori knowledge of the relevant documents to simulate this feedback procedure.) X-9 c) d) e) Compute r. , n., and d., from (1), (2), and (3). Set d. = 0 if d. < 0 l l (7) Process the collection and perform the transformation specified by (5). f) Repeat step a) with the modified collection and the same or different query depending on input specification. 3. Experimental Results The basis experiment consists in applying algorithm (7), programmed on the IBM 360/65, to the Cranfiald collection of 200 documents and 42 queries. In performing the experiments summarized below, two general conditions may obtain during the alteration of the document collection: a) The queries in each group have similar sets of relevant documents, (illustrated in Fig. 4 ) ; b) The queries in each group have different sets of relevant documents, (illustrated in Figs. 2 and 3 ) . In almost all instances, several queries are "batch processed," before the collection is refined. Condition b) is probably more repre- sentative of a real situation since many queries would normally enter a retrieval system before the collection is updated. The recall and precision indicated in Figs. 2 through 5 are typical of the results obtainable with the proposed system. The collection was modified on the basis of prior searches using query 34 for Fig. 2, queries 12, 15, 16, 17, 38, and 41 for Fig. 3^ and queries 7, 15, and 17 X-10 ; Rank i 1 1 2 Document 104 102 R 199 96 18 191 200 99 193 R 109 83 R 98 91 90 64 Correlation .2307 .2091 .1501 .1484 .1451 .1409 .1407 .1394 .1304 .1223 .1169 .1146 .1114 .1102 .0960 Rank 1 2 Document 102 R 104 199 18 191 193 R 96 83 R 200 99 109 64 90 91 98 Correlation .2273 .2262 .1447 .1420 .1409 .1398 .1363 .1349 .1328 .1264 .1192 .1124 .1095 .1091 .1064 3 I 4 5 6 7 8 9 10 11 12 13 14 15 1 34 5 6 7 8 9 10 11 12 13 14 15 1 a) SMART Standard Retrieval b) Iteration Using Query 34 y a o.l, 6 = 0.9, e = 0 Rank Document 102 R 104 193 R 199 18 191 96 83 R 200 99 109 64 91 90 98 Correlation .2290 .2276 .1543 .1444 .1440 .1418 .1359 .1339 .1313 .1282 .1182 .1165 .1091 .1083 .1057 | Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Document 102 R 104 193 R 83 R 191 199 18 96 200 64 99 109 90 91 17 Correlation .2380 .2295 .1455 .1438 .1427 .1420 .1387 .1352 .1287 .1237 .1215 .1173 .1090 .1088 .1076 i 1 1 2 1 1 1 3 4 5 6 7 8 9 10 11 12 13 14 15 c) Iteration Using Queries 34 and 16 Y = 0.1, < = 1.0, e = 50 S d) Iteration Using Query 34 y = 0.2, 6 = 0.5, e = 50 Retrieval Results for Query 16 Fig. 2 X-ll • Precision A A • Curve of Curve of Curve of A Curve of Fig. 2(a) Fig. 2(b) Fig. 2(c) Fig. 2(d) O.I h J_ _L X X i » i i i r Recall 0 .05 .10 .15 .20 .25 .30 .35 .40 .45 .50 Recall-Precision Curve Fig. 2 (contd.) X-12 Rank 1 2 3 4 5 6 7 8 9 10 11 i 12 13 14 15 a) Document 41 R 100 90 R 111 11 45 110 127 104 192 71 159 42 R 76 133 Correlation .4762 .4280 .3859 .3151 .3123 .2896 .2750 .2688 .2637 .2610 .2601 .2576 .2572 .2481 .2480 Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 . 15 • Document 41 R 100 90 R 111 11 45 127 110 104 42 R 159 76 71 133 185 . J Correlation .4784 .4312 .3799 .3177 .3095 .2910 .2737 .2699 .2662 .2620 .2582 .2560 .2506 .2503 .2461 SMART Standard Retrieval b) Space Modification Y = 0.1, 6 - 0.5, e = 50 Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 c) Document 41 R 100 90 R 111 11 45 127 42 R 104 110 76 159 133 185 39 Correlation .4685 .4317 .3615 .3192 .3014 .2901 .2761 .2685 .2672 .2646 .2618 .2580 .2516 .2484 .2453 Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 d) Document 41 R 100 111 90 R 42 R 45 127 76 11 39 104 188 156 185 159 Correlation .4450 .4183 .3152 .3085 .2895 .2731 .2708 .2678 .2643 .2641 .2610 .2492 .2492 .2489 .2478 i Space Modification y = 0.2, 6 = 0.5, e = 50 Space Modification y = °-5/ 5 a °-2> £ = 50 Retrieval Results for Query 7 (Results after iteration on Queries 12, 15, 16, 17, 38 and 41) Fig. 3 X-13 • Precision A • Curve Curve Curve A Curve of of of of Fig. Fig. Fig. Fig. 3(a) 3(b) 3(c) 3(d) A 0.8 0.7 0.6 \ > V*. #, 0.5 0.4 0.3 0.2 0.1 L_ 1 0.1 0.2 L. 1 1 I—— * • .Recall • 0.3 0.4 0.5 0.6 e) Recall-Precision Curve Fig. 3 (contd.) X-14 Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 a) Document 80 81 102 66 82 69 83 88 125 193 114 94 111 124 11 R R R R R R R Correlation .5190 .5021 .4696 .4537 .4218 .4036 .3966 .3831 .3807 .3660 .3578 .3533 .3369 .3341 .3213 Rank 1 2 3 4 5 6 7 8 9 10 Document 80 81 66 102 69 82 125 83 114 94 88 193 111 124 11 R R R R R Correlation .5009 .4889 .4661 .4460 .4192 .4073 .3935 .3828 .3703 .3611 .3589 .3443 .3404 .3361 .3296 1 1 1 1 1 1 1 I u 12 13 14 15 b) R R SMART Standard Retrieval Space Modification Using Queries 7, 15 and 17 Y = 0.1, < » 0.7, e - 50 5 Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Document 80 81 66 102 69 82 125 83 114 94 88 193 111 124 11 R R R R R Correlation .5091 .4964 .4661 .4505 .4193 .4104 .3935 .3883 .3704 .3611 .3606 .3503 .3404 .3362 .3296 R R c) Space Modification Using Queries 7, 15, and 17 y - 0.1, 6 = 0.9, e = 0 Retrieval Results for Query 15 Fig. 4 X-15 Precision A — Curve of Fig. 4(a) • Curve of Fig. 4(b) • Curve of Fig. 4(c) "a-®— a 0.4 0.3 0.2 h 0.1 [- L 0.1 •*-*- Recall 0.2 0.3 0.4 0.5 0.6 d) Recall-Precision Graph Fig. 4 (contd.) X-16 i ' - • • • • • • i Initial Search Figure Queries Relevant _ Documents 10 127 128 129 130 New Search Following Space Modification Relevant Queries Documents 16 67 80 81 82 83 84 85 102 193 41 42 72 90 95 Change in Recall and Precision 2 34 Increase 3 12 15 16 17 38 41 46 48 50 52 54 67 81 83 85 87 90 92 94 102 120 192 194 41 72 81 83 85 87 90 93 95 109 47 49 51 53 55 80 82 84 86 88 91 93 95 109 178 193 195 42 80 82 84 86 88 91 94 102 193 7 Increase 4 7 15 17 15 80 81 82 83 84 85 86 87 88 102 109 193 Decrease 1 1 Summary of Results of Space Modification Fig. 5 X-17 for Fig, 4. Following these space modifications, searches were performed using queries 16, 7, and 15 respectively. One result is immediately obvious. Whenever the collection is modified using non-related queries (queries with different sets of relevant documents from those of the query now at hand), recall and precision increase. However, whenever the collection is first modified using queries with relevant document sets related to those of the query being examined, recall and precision decrease. The latter result may be explained by the fact that although the clustering of relevant documents is accomplished as desired, the centroid of the generated cluster moves away from the query (as in the example of Fig. 4 ) . Specifically, when a space modification is performed, there exists no a priori reason for expecting that the new document space should be such that the relevant documents are clustered around the present query, thus yielding improved precision and recall, since the scheme which controls the document space alteration is independent of the query. However, the relevant document cluster must be located at points in the space which are close to the query being processed if increased precision and recall are to be achieved. The addition of query modification to the system is therefore necessary to assure this closeness; the query will be moved towards the relevant document cluster, thus facilitating improved retrieval results. In order to verify the fact that the relevant documents are grouped together,by the space modification process, the document-document correlations of the original space and of the modified space are computed X-18 as in the example of Figs. 6 and 7. These results confirm the fact that the related documents are indeed grouped more closely together than they were originally. For example, in Fig. 6, Query 16 retrieves only relevant documents 83, 102, and 193. However, the intercorrelations among documents 80, 81, 83, 84, and 88 all increased markedly. While it is true that documents 67, 85, and 102 are essentially unaffected by the modification process, 5 relevant non-retrieved documents were clustered as desired. Fig. 7 offers another example of similarly successful clustering. An additional experiment was conducted to demonstrate that the proposed system, when expanded to include query modification in addition to document space alteration, leads to the desired increase in precision and recall. Specifically, given the modified document space, relevance feedback results are used to modify the query in a fashion similar to that used by [2]. An updated query, Q 1 , is determined from an original query, Q, using the following equation: £R. 2' - Q + ISRJ (8) where the R. are the relevant documents. i The denominator is used to normalize the changes in the modification procedure so that the query is not altered too radically. If this normali- zation were not carried through, the incremented concepts would nullify the effects of any of the components not affected by the modification procedure. Fig. 8 demonstrates the above assertion. Fig. 8(a) represents the original SMART retrieval with the original query; Fig. 8(b) represents retrieval results using the original query and the modified collection. The X-19 m as O CM CM 00 VO CM r- vo as r- ^ oo LO LO vO CM VO r- o LO1 r^ VO CM 00 ' F s oo rH VO LO LO ^r oo rH CM o o ^r CM O rH CM CM CM CM 00 00 ^t1 ^ CM CM CM CM r^ LO 00 00 * ^ as LO o CO LO ^r o rH VO o r^ o oo CM ^rjvo CM rH oo vo rH rH rH O M C M CM|vO VO|00 CM CM CM CM 00 00 00 CM CM CM o ^ o LO as V0 rH V0 00 rH as o 00 O 00 00 ^r oo 00 00 00 00 CM o CM CM r- CM CM LO 00 O VO CM LO rH oo CM as r^ as o 00 VO O rH oo oo as r^ rH 00 V0 CO KD ^r O CM VO LO CM ^ oo as o rVO LO 00 00 ^r LO CM CM 00 00 00 CM 00 CM vO oo oo rH V0 VO CM vO O CO O0 00 vO VO LO oo r^ r> vo O CM ^ o V0 OS O oo 00 00 VO OS rH ^T CM oo oo ^r 00 00 CM ^F LO VO LO VO ^r LO rH CM VO CO 00 LO ^r vo ^r CM OS CO rH vO O CTi CM LO CO CO CO ^r CM Os OS rH VO LO CM CM CM 00 00 00 CM CM VO LO ^d1 LO CM rH oo ^ LO * * * 0) U 0) 3 OJ 0 -P -P J^ Q «^ * CO O ^r oo rrj d) rH 0) > VO CO CM LO O r- CM vo VOIVO r- ^r CM CM 00 00 r^ VO \> * CM 00 00 rH to O rH OO O OS CT> CM CM oo r^ r- o CM CM 00 00 CM 00 K & 0) G 03 CL, 04 W 0 4J TJ 0) - -H rH IW -H II T 3 O ^O g rH • II JH § a) 3 o — £ 0 -H -P rrj rH C D +J aj a £ p LO 00 CM [^ CM OS 00 00 LO LO ^r 00 CM 00 CM VO LO o o LO LO CM CM CM CM LO O r- co ^ as ^r oo o U U O 0 Q o o a) u tQ II § C >P C -H vO < D CO ^ 00 Os KD r- CM O LO CM 00 ^r as V0 CM CM ^r r- oo 00 00 LO LO o o ^r LO OS O0 OS ^ rH 00 rH 00 CM ^r LO LO LO LO * U O 4-1 Ul C -H -4J o -P 0 •H -P ^ -P 0 VOJ3 rH vd g rrj C D U O C -H a) tP -H Pn m 00 CO CM rrj rH vO vO (D M rH > i