VIII-1 VIII. New Experiments in Relevance Feedback E. Ide Abstract New results are given for interactive user controlled retrieval strategies, using relevance feedback. Search strategies are included, based on the identification by the user of a fixed or variable number of relevant or nonrelevant documents. The evaluation results are based on experiments using 200 documents and 42 queries in the field of aerodynamics. 1. The Relevance Feedback Procedure Automated information retrieval systems, like most mechanical processes, suffer from unavoidable inflexibility. The needs of users of a large information collection, especially a document collection, are too varied to be satisfied with any one full automatic retrieval algorithm. Users whose needs best match the assumptions built into the system are satisfied; others are not. One suggested way to overcome this limitation is to employ feedback information from the user during the retrieval process. In a document retrieval situation, this could be accomplished as follows: a) b) The user poses a request to the retrieval system. The retrieval system returns some information (perhaps abstracts) about a specified number of documents judged relevant to the user's request. VIII-2 c) The user selects from this set of initially retrieved items those documents which he_ deems relevant to the request, and feeds this information to the retrieval system. d) Another retrieval search is performed incorporating these user judgments. Steps c) and d) are iterated as often as desired. Such an interactive process was proposed by Rocchio, who called it "relevance feedback" . [1,2,3]. He showed that in a document retrieval system based on classification and using the cosine correlation function, the theoretically optimum query for retrieving a set of documents given by the formula: n r ODt R = {r.} is n s s. /L 1 Tii %z. 1 1 l s i| where: R = {r.} = I the descriptor vectors of all documents in the collection which are relevant, according to the user, to the request; S = {s.} = the descriptor vectors of all other documents in the collection, i.e. of all nonrelevant documents; n n r s r. 1 = = number of relevant documents in the collection; number of nonrelevant documents; , s.i = ' I length of the document descriptor vectors r., s. . c I ' ' 1 1 Of course, the sets R and S are not known to the system. On each iteration, however, the user feedback supplies information about two subsets, R1 e R and S 1 e S , where R' = -{r1.} I is the subset of relevant documents VIII-3 retrieved and Sf = {s!} 1 is the subset of nonrelevant documents retrieved. Therefore, the following formula is used by Rocchio to construct a new query from the query of the previous iteration: n' r n' s Q+ = 8i i1 where n' r (n') s * 4 L ?rr - i E A (A) is the number of relevant (nonrelevant) documents retrieved for feedback in the previous iteration. Rocchio investigated relevance feedback using the above formula and 17 queries in the field of computer science and found that his algorithm does improve retrieval results [1,2,3]. Another investigation of a relevance feedback system was based on the "ADI collection", a collection of 82 documents presented at a conference on documentation. Thirty-five queries were constructed for this collection, and the documents considered relevant to those requests were specified by the two originators of the queries. The investigation of relevance feed- back in the ADI collection was conducted by Riddle, Horwitz, and Dietz [4]. They used 22 of the 35 queries and studied a slightly different algorithm for modifying the search query using the following method of query modification: n r Q.+l = Q. + a \ r! VIII-4 2. The Experimental Environment The document collection used in this study (the "Cranfield" collection) contains 200 documents from the field of aerodynamics, chosen from a library of 1400 documents. For this collection, there exist at present 42 queries, constructed by some of the authors of the 1400 documents; these requestors are also responsible for the relevance judgments. The concept vectors describing document and queries are quite sparse for this collection. The maximum number of concepts used to describe The largest weight The query one document is 85, out of a possible 552 concepts. given to any concept in any document descriptor is 288. description vectors are sparser by one order of magnitude and shorter than the document descriptors. The maximum number of concepts used in a single query vector is 13; the largest weight in any query vector is 24. The largest number of documents relevant to a single query is 12, The comparative brevity of the query or six percent of the collection. vectors in this collection is typical for technical document retrieval, and provides a strong argument for the use of relevance feedback techniques. With relevance feedback, the user, in effect, provides a much more detailed query merely by citing a document; the document descriptor itself is used as the query. The relevance feedback system being studied uses the following query-update formula: min(n a , n^.) Q.+l - 7 Q T + o) Q + a \ min (nb, n^) r. + y \ s! (B) LJ 1 LS 1 VIII-5 where n 1 + nf (see equation A, section 1) equals N , the number of documents retrieved for feedback. The experimental variables are a, OJ, ir, y, n , n , and N. a ID The parameter a is positive, and weights all incoming relevant documents relative to the other contributors to the query (previous query, initial query, nonrelevant documents) . The parameter i permits the previous query to be r Q is the increased in weight relative to the incoming documents. initial query, as opposed to the query of the previous iteration; O J permits the initial query to be used as part of the new query (see section 3B). The parameter y should theoretically be negative, as it permits some significance to be attached to the nonrelevant documents retrieved. The parameter n a (n ) permits some specific number of relevant (nonJ D relevant) documents to be used in the query even if larger. It is assumed that the r! 1 and 1 s! n' (n1) is r s are indexed in order of decreasing relevance (as determined by the system) to the query; that is, the n a relevant documents (or n b nonrelevant documents) used in the new query will be those closest in the descriptor space to the previous query. The flexibility of this formula permits the investigation of several feedback strategies. The system also provides the following formula to simulate Rocchio's algorithm: min (n1/ n ) r a ) / 1 min (n1 , n, ) s b ) / 1 s. (C) Q. . = , *i+l T T n1 n' Q. + u Q + n' > r s * i * o s E **-•* E r, - n' i r Formula (C) does not, however, normalize the vector lengths as does Rocchiof s algorithm. VIII-6 The document and query description vectors for both collections were constructed using a SMART thesaurus dictionary on the document abstracts and the queries [3]. The cosine correlation function is used to determine the order of retrieval. 3. Earlier Results in the Same Environment An earlier study[5] uses the Cranfield 200 document collection and the relevance feedback system described in section 2. variations in relevance feedback strategy are investigated: Three major 1) The parameters IT, OJ, and a in formula B of section 2 are varied, holding \i equal to 0. This strategy is similar to the type investigated by Riddle, Horwitz, and Dietz [4]. The variation in results obtained when IT, OJ, and a are changed is slight, in fact, less than the variation found by Riddle, Horwitz, and Dietz, who used a different document collection. 2) The number of documents retrieved for feedback N is set to 5, 10, and 15 documents. (N) is varied. The improvement ob- tained by feeding back 10 documents instead of 5 is impressive; the further improvement obtained with 15 documents is less so. In addition, a "variable feedback" strategy is investigated, in which documents are retrieved until one relevant document is found, or until 15 documents have been retrieved. This strategy provides greatly improved precision at low recall but only slightly better precision at high recall than does the N = 5 strategy. VIII-7 Two strategies using the information in retrieved nonrelevant documents are studied. In formula B n, b is set (section 2) , \i is set to minus 1 and to 1 and 2. That is, the first nonrelevant document (the first two nonrelevant in the second strategy) is subtracted from the query. n The strategy in which = 1 (called "Dec Hi"), with N equal to 5, produces a performance comparable to that of a strategy using only relevant documents with M equal to 10. The following conclusions may be reached from the results of the earlier study: The investigation supports relevance feedback as an information retrieval strategy. It also shows that varying the parameters in a query-update formula which uses relevant documents only is not a promising way to produce significant improvement in performance. The most promising strategies investigated, variable feedback and nonrelevant document feedback, require further study before they can be firmly recommended. The variable feedback strategy and the suggested combination of fixed and variable feedback should be investigated in a suitable evaluation system. The nonrelevant feedback strategies should be studied in a system which permits normalizing as in Rocchio [1]. Eventually, some combination of fixed and variable feedback may prove optimal in similar information retrieval environments. 4. Evaluation of Retrieval Performance A) The "Feedback Effect" in Evaluation The investigation described in section 3 uses a retrieval and evaluation method that has been assessed by Hall and Weiderman [6]. After VIII-8 each iteration, all documents in the collection are ranked and the topranked N documents are used for feedback. Hall and Weiderman point out that evaluation of this retrieval technique takes into account two effects, which they call "ranking effect" and "feedback effect". Relevance feedback in effect uses information from one or more document descriptors to modify the query descriptor. The relevant documents used for this purpose will be ranked higher by the modified query than previously, and the nonrelevant documents used will be ranked lower. effect of these rank changes in "retrieved" documents is termed the "ranking effect". If the ranking effect is included in an overall The performance measure, the measured change in performance between feedback iterations is quite impressive, as is seen in the results included in the earlier report described in section 3 [5]. This large change in "total performance" (including both ranking and feedback effect) indicates the extent to which the initial query has been perturbed toward the centroid of the relevant documents, and strongly supports Rocchiofs theory. Hall and Weiderman state that in an environment where the user must actively supply relevance judgments for feedback, changes in the ranks of documents which the user has already seen are of no interest to him. The user in such an environment is concerned primarily with the "feedback effect"; that is, the effectiveness of the modified query in bringing new relevant documents to his attention. They conclude that, though total performance is a valid measure of the effectiveness of relevance feedback in approaching the "ideal query", the feedback effect should be isolated and examined as well. VIII-9 The present study evaluates feedback performance in the manner suggested by Hall and Weiderman by discarding the ranking effect and preserving only the feedback effect. The ranks of the top N documents retrieved in each iteration (the documents used for feedback) are "frozen" in all subsequent iterations, and only the remainder of the collection is searched using the modified query. Thus, in the present investigation, the N documents retrieved on any iteration are guaranteed to be N new documents; that is, documents not used for feedback on any previous iteration. Moreover, the performance measures for the first (second, third) iteration are calculated from a ranked document list in which the top N (2N, 3N) documents are the same as those retrieved previously. Only the changes in the ranks of documents not yet seen by the user is measured. The evaluation described gives overall results that are deceptively low. Because the top ranks are frozen, no newly retrieved document can achieve a rank higher than that of any previously retrieved document. With a constant feedback strategy, therefore, on the first (second, third) iteration, the highest possible rank for a new document is N+l (2N+1, 3N+1). For this reason, the feedback effect evaluation is a misleading measure of the performance of the retrieval system, and should be used in conjunction with other evaluation methods. Iso- lation of the feedback effect is primarily useful to compare different feedback strategies from the viewpoint of a user in an interactive retrieval environment. VIII-10 B) Performance Measures Several average measures of the performance of the tested retrieval algorithms on the 42 Cranfield queries are used in this report. Each measure is based on the concepts of "recall" and "precision". In evaluating an in- formation retrieval system, an arbitrary cut-off is often employed, and documents above this cut-off are termed "retrieval". With such a cut-off, "recall" is the percentage of documents relevant to the user that are retrieved, and "precision" is the percentage of retrieved documents that are relevant. The average measures used in this study do not employ a cut-off, but evaluate the retrieval performance over the entire document collection. A discussion of the generalization of the concepts of recall and precision to such an overall evaluation is found in reference 7. The average measures used herein are: Rank recall, log precision, normalized recall, normalized precision, and a curve reflecting average precision at each 10% of recall. The first 4 of these measures are defined in reference 7. The recall-precision curve used here differs from any used in previous studies. Both the Quasi-Cleverdon curve, used in the earlier study (section 3), and the new curve used here are average plots of precision at each 5% or 10% of recall. Each query is averaged into each point of the plot. To accomplish this averaging process, an interpolation procedure is needed, since, for example, a query with two relevant documents can only achieve uninterpolated recall levels of 50% and 100%. The Quasi-Cleverdon curve and new curve are distinguished by the method of interpolation used. VIII-11 Figs. 1 and 2 show two graphs for a hypothetical query having 4 relevant documents. The relevant documents are assumed to be retrieved Thus, at 25% recall, the precision is However, these with ranks of 4, 6, 12 and 20. 25%, at 50% recall, the precision is 33%, and so on. values correspond actually to the highest possible precision points, since they are calculated just after a relevant document is retrieved. In this example, after 3 documents are retrieved, the precision is 0%, after 5 documents, the precision is 20%, and so on. This range of precision for each recall level is indicated by the top and bottom points in Figs. 1 and 2 at 25%, 50%, 75%, and 100% recall. The solid saw-tooth line connecting these points is not used for interpolation; it is rather intended to indicate the drop in precision between the actual recall levels for this query, as more nonrelevant documents are retrieved. The Quasi-Cleverdon averages use a straight-line interpolation between peak points of precision, as indicated by the dashed line in Fig. 1. It has been argued that this interpolation is artificially high, since it lies at all points above the sawtooth curve, and thus, does not reflect, in any way, the precision drop as more nonrelevant documents are retrieved. The new averages of Fig. 2, use an interpolation that projects a horizontal line leftward from each peak point of precision, and stops when a higher point of precision is encountered. This new interpolation curve (the dashed line in Fig. 2) does not lie above the sawtooth curve at all points. When the precision drops from one recall level actually VIII-12 achieved to the next, an immediate drop in precision after the first point to the level of the next point is indicated. For example, in Fig. 2, the precision value at 50% recall is 33%, but at 55% recall, the interpolated value used for the new averages is 25% precision. When the precision rises from one recall level to the next, however, the first precision point actually achieved is ignored for purposes of interpolation. The achieved precision of 25% at 25% recall in the example of Fig. 2 is ignored, and for all recall levels from 0 to 50%, an interpolated precision of 33% is used for the new averages. The proponents of the new interpolation argue that this method indicates in all uses a precision that the user could actually achieve, if he were to use hindsight by retrieving exactly the right number of documents. The new averages are now used both at Harvard and at Cornell in evaluating the SMART system. C) Statistical Tests Several statistical tests are reported here using as input the rank recall, log precision, normalized recall, normalized precision, and 10 points from the new recall-precision curve. The statistical tests are intended to measure the "significance" of the average difference in values of these measures obtained for two iterations or two distinct search algorithms. The test results are expressed as the probability that the two sets of values obtained from two separate runs are actually drawn from samples which have the same characteristics. A small probability value thus inIf this dicates that the two curves are actually significantly different. probability for one measure is, for example, 5%, the difference in the two average values of that measure is said to be "significant at the 5% level". \ original curve interpolated curve 40 50 60 Recall 80 90 100 An Illustration of the Interpolation Method Used for the Quasi-Cleverdon Recall-Precision Averages Fig. 1 original curve interpolated curve 10 20 30 40 50 60 Recall 70 80 90 100 An Illustration of the Interpolation Method Used for the New Recall-Precision Averages Fig. 2 VIII-14 Choice of a statistical method for calculating this probability is important. The present study uses two statistical tests, the familiar T-test and the Wilcoxon Signed-Rank Test (WSR) [8]. The T-test takes account of the magnitude of the differences, and assumes that the measures tested are normally distributed. The WSR test does not make this assumption. However, the WSR test takes account only of the ranks of the differences, ignoring their magnitude. Because this test does not assume normality of the input and because it ignores some information (magnitudes of differences), the WSR test is more conservative than the T-test. It is therefore less prone to the error of calling a result "significant" when it is not. Because information retrieval provides discrete rather than continuous data, and because only 42 data points (42 queries) are provided, the more conservative WSR test is preferable in the present evaluation. 5. Experimental Results Results of three major areas of investigation are presented: a) A comparison of two strategies that use only R' , the set of relevant documents retrieved, to modify the query vector. b) An investigation of the retrieval effect of the number of documents used for relevance judgment feedback on each iteration. c) An investigation of strategies using the set S 1 , that is, the nonrelevant documents retrieved, to modify the query. The statistical significance of the average results obtained is tested in each case. VIII-15 A) Two Strategies Using Relevant Documents Only In the earlier report, summarized in section 3 [5], several strategies using relevant documents only are compared. The differences in A feed- total performance found among these strategies were very slight. back effect comparison of two "relevant only" strategies is made here. The strategies chosen are: 1) The straightforward strategy of setting a equal to 1, T T equal to 1, and the other constants equal to 0, in formula B (section 2 ) . The feedback formula in effect for this strategy is therefore: *i+l *i This formula is not equivalent to any strategy used in the earlier study because the feedback effect evaluation provides new documents for feedback on each iteration (section 4 A ) , while the total performance evaluation does not. The nearest comparable strategy from the previous study is the so-called "Q 2) strategy" (also called "due only"), A strategy that gives added weight to the user's original query: T = 1, a = 1, c = 4. T o This strategy is intended to compensate for the large difference in magnitude between document vector weights and query vector weights (section 2 ) . The difference in feedback effect between these two methods is trivial. For all average measures, the differences observed are less than The recall level averages for the second This one-and-a-quarter percent, strategy, called "Q +", are presented in Fig. 6 to be described later. results is hardly surprising, since for several "relevant only" strategies, the total performance results reported in the earlier study are nearly identical. VIII-16 B) Varying the Amount of Feedback In the earlier study (section 3 ) , the improvement in total performance achieved by feeding back ten rather than five documents to the user is impressive. effect. This difference, however, is primarily due to the ranking The feedback effect results, shown in Fig. 3, are actually better at medium recall levels when only five documents are used for feedback. At high recall levels, the performance ahieved by feeding back five documents twice is roughly equal to that obtained by feeding back ten documents once. The average improvement in feedback effect gained by feeding back more documents on each iteration does not seem worth the cost of asking the user to make more relevance judgments. It must be noted, however, that the feedback effect evaluation gives an unfair advantage to runs using few documents for feedback. When five documents are used for feedback, ranks 1-5 are frozen on the first iteration and ranks 1-10 on the second. When ten documents are fed back, however, ranks 1-10 are frozen on the first iteration and ranks 1-20 on the second. The difference in results caused by increasing the number of documents fed back is therefore minimized by the evaluation process. A variable feedback strategy is tested in the earlier study. The user is asked to search the retrieved list from the top until he finds one relevant document or has seen fifteen documents. The total performance reported indicates that the user who does not require high recall (50% recall or less) reaps considerable benefit from this strategy, but that high recall performance is very little better than that obtained by a constant feedback with N equal to 5. The feedback effect evaluation produces VIII-17 ** oo ^ CO ^ O O O O O C O O O ^ < X > C N O O v O O O L O O H H ^ r ^ v D H > > 4 cch 1- Ho 0 •H 4- 1 fd *H CD H o LO LO ^ ^> ro 00 LD cr>o >oo^rr^Ho >'sj rHoo LO O ^D O CN r H O O H i n C N J O O O L O r H O O | C 1 N 1 1 1 H 1 1 e n 3 e n > •H CD H 00 rH ^0 <7i 00 o rHcr>00(Nr^r-oooo^m LD LO CN m K O Q c u fd C D -P H 1 j ^ ^ v£> CO (N^rmOOOOoOOOHCNH rH 1 1 1 1 •rH * H ro T^ ^ CTi VD 0 -H r H oo CO 0 0 o^ooHr^vocsioo^oo e^ C P i H O O ^ D O C M H O O O CM o o o o c o L n ^ o v D L o r ^ m v D O OJ ^ + JM w - ^ c: i •H ?H t-4 in rH 0) d Iterations fi o u (D CO Xf) rH CD •2 + > ro r^ v.ooocr>c7»^rcTiCNOOr^cF> ^o^^srcNeNooc^cTir-oo ' •rH + 0 03 U CD -H H3 ^r i CO O O 1 s\ 0) + *H Oi + •rH + •rH Fir -P fd e n 2 -P H I r- ro 1 1 cNoooocnoooor^mrHoo I i l w l w u > C O a ^\ fd C D - en £ -P 0 e c n fd u •rH MM -rH £ LO > •H rH a - en £ CDi oo ro r» ! 1 * e n C D H1 rH 00 CNi^rooHinLn'^r-cT>Ln ^ h H C O O O i n H C N h C N 00LnvDLO^rmvx)^j,«x>t^ II C D Cn • H *H •rH I Q ^ 0 -rH oo i n -p fd 5M r^ C\ - H fl 1 1 rH -rH rH 'd fd fa II rH CN 1 Oi rH a^ 00 LO V£) O -P •• * CN rH o -P rH ^ in tf rH O^f H CN & C N O O C N ^ v O C N O C T i O O 1 rH rH + 01 i o e n 3 e n 5M CD i H C D -P H rH CN ^r o u o U--I H rH fd 0) •H • r l •H -H -P -P -P -P fd C D fd C D fd fd C J tf 0 a 0 0 d 0 ex (D Q I ^r I CM 1 H M 1 u CD u CD -P +J •P -P H H H H o > -rH O O o i ; 00 r^ rH T CT> <0 Ln C T > 0 O ^ 0 0 C N L O 0 0 C N L O < £ > G \ C 0 < J \ ( J \ ( T \ ^ K O ^ i r ) \ D C D N o ex C O + 3 •H rH fd cn c n C O +0 ^ -p H O ^ rH cr> O o O H H r H L O r ^ r H r ^ v D 1 1 rH £ O 1 \D CM 1 1 n 0 u C D cn 53 4-1 > •H M C D L H ' o\° O rH rH fd U C D o\° o\° o\° o\° o\° o\° o\° o\° oV> O O O O O O O O O O > -H O (D O fd u u M o (D • • u CU u Hi • H C N o o ^ i n * x > r ^ o o c r > ro H rH C D C D r-1 s 2 CD • ti C D tf • c u 0 > U •H fr ^ C 2 tP U 0 0 6 g • S u H < D •H C Cn CO M M Comparing Thr ^ CO > CT> CD rH r- O 00 C T > L O H C T > 0 0 [ ^ ^ > 0 0 H 0 O -P cn a) ^r• CM • 0 0• LO • •• +0 ^r G\ ^r •H r^oocT>oocnoo^oincN cooocoi^^)^LnLncno> 3 •A o o S C) •• edbac feren 0) C D -p H -p u no CD ^r H in CNJ^OiDt^CT.OO^HvDLn H I -r + •H ^—first iteration --A—second iteration 10 20 30 4 0 50 60 70 80 90 100 Recall First Iter. vs. Initial Search Percent Percent Difference Significance Rank Recall 4.1 3.8 Log Precision 4.1 2.3 Normalized Recall 3.1 0.6 Normalized Precision 2.7 1.8 6.8 Recall Level 10% 1.0 4.3 2 0 % 1.8 1.5 3 0 % 2.2 0.5 4 0 % 3.0 0.8 5 0 % 3.1 5.7 6 0 % 3.5 7 0 % 2.4 29.2 8 0 % 3.2 18.3 6.1 9 0 % 3.6 6.1 100% 3.6 Second Iter. vs. .Initial Seared Percent Percent Significance Difference 1.9 5.1 1.2 2.8 0.1 3.7 0.5 3.6 1.3 1.8 0.8 2.5 0.9 3.0 0.4 3.7 I.I 3.7 2.8 5.7 19.5 4.4 4.5 4.9 2.4 5.6 2.6 5.2 Comparison of First and Second Iterations to Initial Search Differences and Significance Levels - Q + Algorithm o Fig. 6 VIII-25 60 kz^£^. A AS \ \ Rocchio Algorithm o initial search — A — first iteration 'S>A — A — second iteration 50 o CO £40 Rocchio: Qi+I =n r n sQi \ n'sSn-nrSj A^ + 30 10 20 30 4 0 50 60 70 80 90 100 Recall First Iter. vs. Initial Search Second Iter. vs. Initial Search Percent Percent Percent Percent Difference Significance Difference Significance 5.4 5.3 Rank Recall 19.1 4.0 2.6 8.2 Log Precision 1.8 32.6 23.5 -0.8 Normalized Recall -3.0 60.0 1.6 20.8 Normalized Precision -0.1 55.5 30.7 1.6 Recall Level 10% 1.0 51.8 12.4 2.8 25.8 20% 2.1 5.3 3.2 14.9 2.3 30% 4.1 2.2 10.4 2.9 40% 3.0 4.3 8.8 3.2 50% 28.7 4.5 47.7 3.1 60% 5.4 12.0 43.2 3.1 70% 8.5 5.8 22.1 4.3 80% 5.7 5.1 4.4 19.1 90% 5.3 7.4 21.1 100% 4.2 Comparison of First and Second Iterations to Initial Search Differences and Significance Levels - Rocchio Algorithm Fig. 7 I 60? Dec Hi Algorithm Dec Hi: Qi+i = + Qj 2n-si 50h o o 0)40 Q. — initial search first iteration --second iteration 30h x 10 ± ± J L J_ J •V. L 20 30 4 0 50 60 70 80 90 100 Recall X -o First Iter. vs. Initial Search Second Iter.vs.Initial Search Percent Percent Percent Percent Difference Significance Difference Significance 9.5 13.6 4.8 Rank Recall 4.3 2.6 9.2 Log Precision 14.5 2.3 19.6 -3.3 Normalized Recall -2.3 45.1 14.1 0.6 25.9 Normalized Precisioni 1.5 21.5 1.6 30.9 Recall Level 10% 1.2 8.4 2.4 2.9 13.3 20% 7.1 3.3 8.3 2.8 30% 4.7 3.7 5.5 1.8 40% 4.1 4.5 2.3 50% 4.0 4.4 6.5 5.2 14.5 60% 16.7 2.8 39.1 5.1 70% 8.5 16.7 5.4 4.3 80% 10.5 4.2 4.6 17.3 90% 10.9 4.1 4.5 17.9 100% Comparison of First and Second Iterations to Initial Search Differences and Significance Levels - Dec Hi Algorithm Fig. 8 VIII-27 recall are the precision differences not significant at the 10% level*. On the second iteration, the performance difference is significant at the 5% level for all points except 70% recall. For the Rocchio strategy, however, only one measure (precision at 50% recall) shows a significant difference between the first iteration and initial search at the 10% level or less. Even on the second iteration, only 8 of the 14 differences The are significant at the 10% level or less, two at the 5% level or less. significance of the corresponding differences for the Dec Hi strategy are similar. This difference between strategies in the significance of the improvement obtained by feedback leads to a general conclusion: Per- formance on all measures is less consistent for the nonrelevant document strategies than for the Q + strategy. However, since the average magnitude of this improvement is equal for the three algorithms (from the significance results presented in Fig. 5 ) , it must be true that the Rocchio and Dec Hi strategies are better for some queries and worse for others than is the more consistent Q + strategy. The greater variance of nonrelevant document strategies is therefore demonstrated not only by the normalized recall, but by all performance measures. The results given here seem to indicate that for some types of queries, nonrelevant documents should be used for feedback, but For these comparisons, a one-tailed significance level is appropriate, since performance is expected to improve. To obtain one-tailed values, the reported two-tailed values must be divided by two. That is, the probability that the first iteration is no better than the initial search is 5% or less except at 70 and 80% recall. VIII-28 for others, only relevant documents should be used. If the queries approp- riate to each strategy could be distinguished easily before the retrieval operation, performance of the system could be improved by choosing the appropriate strategy for each query. Procedures for distinguishing such subgroups of queries must be investigated. 6. Summary and Recommendations The isolation of the feedback effect adds to an understanding of relevance feedback in an automatic interactive retrieval system. The present investigation supports the earlier finding [5] that changing the constant formula parameters in the simplest algorithm — using only relevant documents — has little effect on retrieval. This study contra- dicts the earlier conclusion concerning the optimum amount of feedback. Looking only at the feedback effect, returning ten rather than five documents no longer seems worth the extra user effort. However, feedback effect evaluation tends to minimize differences caused by varying the number of documents used for feedback. The combination of the results for total performance and feedback effect favors the general use of "variable feedback", in which the user searches the retrieval list only until one relevant document is found. In feedback effect, strategies using nonrelevant documents no longer display an average performance superior to strategies using relevant documents only. Significance tests indicate that "relevant only" strategies are superior for some queries and nonrelevant document strategies for others. VIII-29 Several areas of investigation are recommended. At least one more iteration of variable feedback, and at least two iterations of the "combination strategy" recommended in the earlier total performance study [5] should be investigated using feedback effect evaluation. Significance tests on Rocchio's strategy the variable feedback results should also be obtained. with normalization should be investigated. Results of comparing relevant only and nonrelevant document strategies strongly suggest that an investigation be made of subgroups of queries. These types of algorithms seem appropriate to different groups It would be useful to be able to choose the appropriate For of queries. strategy for a query before retrieval, by examination of the query. variable feedback, an investigation of query subgroups should be performed to determine whether or not some identifiable group of users is shortchanged by this strategy. The use of query subgroups, however, raises questions of sample adequacy. Even if the 200 documents and 42 queries of the Cranfield 200 collection are representative of a typical retrieval environment, the statistical dangers of dividing 42 queries into small subgroups must be considered. Investigation of relevance feedback should therefore be The Cranfield 1400 document continued in a more adequate environment. collection, available to the SMART system would provide significantly larger samples of documents and queries. References [1] J. J, Rocchio, Relevance Feedback in Information Retrieval, Scientific Report No. ISR-9 to the National Science Foundation, Section III, Harvard Computation Laboratory, August 1965. J. J. Rocchio, Document Retrieval System Optimization and Evaluation, Harvard University Doctoral Thesis, Scientific Report No. ISR-10 to the National Science Foundation, Harvard Computation Laboratory, March 1966. J. J. Rocchio, G. Salton, Search Optimization and Iterative Retrieval Techniques, Proceedings of the Fall Joint Computer Conference, Las Vegas, November 1965. W. Riddle, T. Horwitz, R. Dietz, Relevance Feedback in Information Retrieval Systems, Scientific Report No. ISR-11 to the National Science Foundation, Section VI, Department of Computer Science, Cornell University, June 1966. E. Ide, User Interaction with an Automated Information Retrieval System, Scientific Report No. ISR-12 to the National Science Foundation, Section VIII, Department of Computer Science, Cornell University, June 1967. H. Hall, N. Weiderman, The Evaluation Problem in Relevance Feedback Systems, Scientific Report No. ISR-12 to the National Science Foundation, Section XII, Department of Computer Science, Cornell University, June 1967. G. Salton, The Evaluation of Automatic Retrieval Processes, Selected Test Results using the SMART System, American Documentation, Vol. 16, No. 3, July 1965. F. Wilcoxon, Some Rapid Approximate Statistical Procedures, American Cyanamid Company, Stamford, Connecticut, 1949. [2] [3] [4] [5] [6] [7] [8]