VIII-1

VIII.

New Experiments in Relevance Feedback E. Ide

Abstract New results are given for interactive user controlled retrieval strategies, using relevance feedback. Search strategies are included, based

on the identification by the user of a fixed or variable number of relevant or nonrelevant documents. The evaluation results are based on experiments

using 200 documents and 42 queries in the field of aerodynamics.

1.

The Relevance Feedback Procedure Automated information retrieval systems, like most mechanical

processes, suffer from unavoidable inflexibility.

The needs of users of

a large information collection, especially a document collection, are too varied to be satisfied with any one full automatic retrieval algorithm. Users whose needs best match the assumptions built into the system are satisfied; others are not. One suggested way to overcome this limitation is to employ feedback information from the user during the retrieval process. In a document

retrieval situation, this could be accomplished as follows:

a) b)

The user poses a request to the retrieval system. The retrieval system returns some information (perhaps abstracts) about a specified number of documents judged relevant to the user's request.

VIII-2

c)

The user selects from this set of initially retrieved items those documents which he_ deems relevant to the request, and feeds this information to the retrieval system.

d)

Another retrieval search is performed incorporating these user judgments.

Steps c) and d) are iterated as often as desired. Such an interactive process was proposed by Rocchio, who called it "relevance feedback" . [1,2,3]. He showed that in a document retrieval system

based on classification and using the cosine correlation function, the theoretically optimum query for retrieving a set of documents given by the formula: n r
ODt

R = {r.} is

n s s.

/L
1

Tii

%z.
1

1

l s i|

where:

R = {r.} =
I

the descriptor vectors of all documents in the collection

which are relevant, according to the user, to the request; S = {s.} = the descriptor vectors of all other documents in the

collection, i.e. of all nonrelevant documents; n n r s r.
1

= =

number of relevant documents in the collection; number of nonrelevant documents; , s.i =
'
I

length of the document descriptor vectors r., s. .
c

I

'

'

1

1

Of course, the sets

R

and

S

are not known to the system.

On each

iteration, however, the user feedback supplies information about two subsets, R1 e R and S 1 e S , where R' = -{r1.}
I

is the subset of relevant documents

VIII-3

retrieved and

Sf = {s!}
1

is the subset of nonrelevant documents retrieved.

Therefore, the following formula is used by Rocchio to construct a new query from the query of the previous iteration: n' r n' s

Q+ = 8i i1
where n' r (n') s

* 4 L ?rr - i E A

(A)

is the number of relevant (nonrelevant) documents retrieved

for feedback in the previous iteration.

Rocchio investigated relevance feedback using the above formula and 17 queries in the field of computer science and found that his algorithm does improve retrieval results [1,2,3]. Another investigation of a relevance feedback system was based on the "ADI collection", a collection of 82 documents presented at a conference on documentation. Thirty-five queries were constructed for this collection,

and the documents considered relevant to those requests were specified by the two originators of the queries. The investigation of relevance feed-

back in the ADI collection was conducted by Riddle, Horwitz, and Dietz [4]. They used 22 of the 35 queries and studied a slightly different algorithm for modifying the search query using the following method of query modification:
n

r Q.+l = Q. + a \ r!

VIII-4

2.

The Experimental Environment The document collection used in this study (the "Cranfield"

collection) contains 200 documents from the field of aerodynamics, chosen from a library of 1400 documents. For this collection, there exist at

present 42 queries, constructed by some of the authors of the 1400 documents; these requestors are also responsible for the relevance judgments. The concept vectors describing document and queries are quite sparse for this collection. The maximum number of concepts used to describe The largest weight The query

one document is 85, out of a possible 552 concepts.

given to any concept in any document descriptor is 288.

description vectors are sparser by one order of magnitude and shorter than the document descriptors. The maximum number of concepts used in

a single query vector is 13; the largest weight in any query vector is 24. The largest number of documents relevant to a single query is 12, The comparative brevity of the query

or six percent of the collection.

vectors in this collection is typical for technical document retrieval, and provides a strong argument for the use of relevance feedback techniques. With relevance feedback, the user, in effect, provides a much more detailed query merely by citing a document; the document descriptor itself is used as the query. The relevance feedback system being studied uses the following query-update formula: min(n a , n^.) Q.+l - 7 Q T + o) Q + a \ min (nb, n^) r. + y \ s! (B)

LJ
1

LS
1

VIII-5

where

n 1 + nf

(see equation A, section 1) equals

N , the number of

documents retrieved for feedback.

The experimental variables are a, OJ, ir, y, n , n , and N. a ID

The parameter

a is positive, and weights all incoming relevant documents relative to the other contributors to the query (previous query, initial query, nonrelevant documents) . The parameter i permits the previous query to be r Q is the

increased in weight relative to the incoming documents.

initial query, as opposed to the query of the previous iteration; O J permits the initial query to be used as part of the new query (see section 3B). The parameter y should theoretically be negative, as it permits

some significance to be attached to the nonrelevant documents retrieved. The parameter n a (n ) permits some specific number of relevant (nonJ D

relevant) documents to be used in the query even if larger. It is assumed that the r!
1

and
1

s!

n' (n1) is r s are indexed in order of

decreasing relevance (as determined by the system) to the query; that is, the n a relevant documents (or n b nonrelevant documents) used in the

new query will be those closest in the descriptor space to the previous query. The flexibility of this formula permits the investigation of

several feedback strategies. The system also provides the following formula to simulate Rocchio's algorithm: min (n1/ n ) r a ) / 1 min (n1 , n, ) s b ) / 1 s. (C)

Q. . = , *i+l

T T

n1 n' Q. + u Q + n' > r s * i * o s

E **-•* E
r, - n' i r

Formula (C) does not, however, normalize the vector lengths as does Rocchiof s algorithm.

VIII-6

The document and query description vectors for both collections were constructed using a SMART thesaurus dictionary on the document abstracts and the queries [3]. The cosine correlation function is used

to determine the order of retrieval.

3.

Earlier Results in the Same Environment An earlier study[5] uses the Cranfield 200 document collection

and the relevance feedback system described in section 2. variations in relevance feedback strategy are investigated:

Three major

1)

The parameters IT, OJ, and a in formula B of section 2 are varied, holding \i equal to 0. This strategy is similar to

the type investigated by Riddle, Horwitz, and Dietz [4]. The variation in results obtained when IT, OJ, and a are changed is slight, in fact, less than the variation found by Riddle, Horwitz, and Dietz, who used a different document collection. 2) The number of documents retrieved for feedback N is set to 5, 10, and 15 documents. (N) is varied.

The improvement ob-

tained by feeding back 10 documents instead of 5 is impressive; the further improvement obtained with 15 documents is less so. In addition, a "variable feedback" strategy is investigated, in which documents are retrieved until one relevant document is found, or until 15 documents have been retrieved. This

strategy provides greatly improved precision at low recall but only slightly better precision at high recall than does the N = 5 strategy.

VIII-7

Two strategies using the information in retrieved nonrelevant documents are studied. In formula B n, b is set

(section 2) , \i is set to minus 1 and to 1 and 2.

That is, the first nonrelevant document

(the first two nonrelevant in the second strategy) is subtracted from the query. n The strategy in which

= 1 (called "Dec Hi"), with N equal to 5, produces

a performance comparable to that of a strategy using only relevant documents with M equal to 10. The following conclusions may be reached from the results of the earlier study: The investigation

supports relevance feedback as an information retrieval strategy. It also shows that varying the

parameters in a query-update formula which uses relevant documents only is not a promising way to produce significant improvement in performance. The most

promising strategies investigated, variable feedback and nonrelevant document feedback, require further study before they can be firmly recommended. The variable

feedback strategy and the suggested combination of fixed and variable feedback should be investigated in a suitable evaluation system. The nonrelevant feedback strategies

should be studied in a system which permits normalizing as in Rocchio [1]. Eventually, some combination of fixed and variable feedback may prove optimal in similar information retrieval environments.

4.

Evaluation of Retrieval Performance

A)

The "Feedback Effect" in Evaluation

The investigation described in section 3 uses a retrieval and evaluation method that has been assessed by Hall and Weiderman [6]. After

VIII-8

each iteration, all documents in the collection are ranked and the topranked N documents are used for feedback. Hall and Weiderman point out

that evaluation of this retrieval technique takes into account two effects, which they call "ranking effect" and "feedback effect". Relevance feedback in effect uses information from one or more document descriptors to modify the query descriptor. The relevant documents

used for this purpose will be ranked higher by the modified query than previously, and the nonrelevant documents used will be ranked lower. effect of these rank changes in "retrieved" documents is termed the "ranking effect". If the ranking effect is included in an overall The

performance measure, the measured change in performance between feedback iterations is quite impressive, as is seen in the results included in the earlier report described in section 3 [5]. This large change in "total performance" (including both ranking and feedback effect) indicates the extent to which the initial query has been perturbed toward the centroid of the relevant documents, and strongly supports Rocchiofs theory. Hall and Weiderman state that in an environment where the user must actively supply relevance judgments for feedback, changes in the ranks of documents which the user has already seen are of no interest to him. The user in such an environment is concerned primarily with the

"feedback effect"; that is, the effectiveness of the modified query in bringing new relevant documents to his attention. They conclude that,

though total performance is a valid measure of the effectiveness of relevance feedback in approaching the "ideal query", the feedback effect should be isolated and examined as well.

VIII-9

The present study evaluates feedback performance in the manner suggested by Hall and Weiderman by discarding the ranking effect and preserving only the feedback effect. The ranks of the top N documents

retrieved in each iteration (the documents used for feedback) are "frozen" in all subsequent iterations, and only the remainder of the collection is searched using the modified query. Thus, in the present

investigation, the N documents retrieved on any iteration are guaranteed to be N new documents; that is, documents not used for feedback on any previous iteration. Moreover, the performance measures for the first

(second, third) iteration are calculated from a ranked document list in which the top N (2N, 3N) documents are the same as those retrieved previously. Only the changes in the ranks of documents not yet seen

by the user is measured. The evaluation described gives overall results that are deceptively low. Because the top ranks are frozen, no newly retrieved

document can achieve a rank higher than that of any previously retrieved document. With a constant feedback strategy, therefore, on the first

(second, third) iteration, the highest possible rank for a new document is N+l (2N+1, 3N+1). For this reason, the feedback effect evaluation

is a misleading measure of the performance of the retrieval system, and should be used in conjunction with other evaluation methods. Iso-

lation of the feedback effect is primarily useful to compare different feedback strategies from the viewpoint of a user in an interactive retrieval environment.

VIII-10

B)

Performance Measures

Several average measures of the performance of the tested retrieval algorithms on the 42 Cranfield queries are used in this report. Each measure is based on the concepts of "recall" and "precision". In evaluating an in-

formation retrieval system, an arbitrary cut-off is often employed, and documents above this cut-off are termed "retrieval". With such a cut-off, "recall" is the percentage of documents relevant to the user that are retrieved, and "precision" is the percentage of retrieved documents that are relevant. The average measures used in this study do not employ a cut-off,

but evaluate the retrieval performance over the entire document collection. A discussion of the generalization of the concepts of recall and precision to such an overall evaluation is found in reference 7. The average measures used herein are: Rank recall, log precision, normalized recall, normalized precision, and a curve reflecting average precision at each 10% of recall. The first 4 of these measures are defined in reference 7. The recall-precision curve used here differs from

any used in previous studies. Both the Quasi-Cleverdon curve, used in the earlier study (section 3), and the new curve used here are average plots of precision at each 5% or 10% of recall. Each query is averaged into each point of the plot. To accomplish this averaging process, an interpolation procedure is needed, since, for example, a query with two relevant documents can only achieve uninterpolated recall levels of 50% and 100%. The Quasi-Cleverdon curve and new curve are distinguished by the method of interpolation used.

VIII-11

Figs. 1 and 2 show two graphs for a hypothetical query having 4 relevant documents. The relevant documents are assumed to be retrieved Thus, at 25% recall, the precision is However, these

with ranks of 4, 6, 12 and 20.

25%, at 50% recall, the precision is 33%, and so on.

values correspond actually to the highest possible precision points, since they are calculated just after a relevant document is retrieved. In this example, after 3 documents are retrieved, the precision is 0%, after 5 documents, the precision is 20%, and so on. This range of

precision for each recall level is indicated by the top and bottom points in Figs. 1 and 2 at 25%, 50%, 75%, and 100% recall. The

solid saw-tooth line connecting these points is not used for interpolation; it is rather intended to indicate the drop in precision between the actual recall levels for this query, as more nonrelevant documents are retrieved. The Quasi-Cleverdon averages use a straight-line interpolation between peak points of precision, as indicated by the dashed line in Fig. 1. It has been argued that this interpolation is artificially high, since it lies at all points above the sawtooth curve, and thus, does not reflect, in any way, the precision drop as more nonrelevant documents are retrieved. The new averages of Fig. 2, use an interpolation that projects

a horizontal line leftward from each peak point of precision, and stops when a higher point of precision is encountered. This new interpolation

curve (the dashed line in Fig. 2) does not lie above the sawtooth curve at all points. When the precision drops from one recall level actually

VIII-12

achieved to the next, an immediate drop in precision after the first point to the level of the next point is indicated. For example, in Fig. 2, the

precision value at 50% recall is 33%, but at 55% recall, the interpolated value used for the new averages is 25% precision. When the precision

rises from one recall level to the next, however, the first precision point actually achieved is ignored for purposes of interpolation. The

achieved precision of 25% at 25% recall in the example of Fig. 2 is ignored, and for all recall levels from 0 to 50%, an interpolated precision of 33% is used for the new averages. The proponents of the new interpolation argue

that this method indicates in all uses a precision that the user could actually achieve, if he were to use hindsight by retrieving exactly the right number of documents. The new averages are now used both at Harvard

and at Cornell in evaluating the SMART system.

C)

Statistical Tests

Several statistical tests are reported here using as input the rank recall, log precision, normalized recall, normalized precision, and 10 points from the new recall-precision curve. The statistical tests are

intended to measure the "significance" of the average difference in values of these measures obtained for two iterations or two distinct search algorithms. The test results are expressed as the probability that the two sets

of values obtained from two separate runs are actually drawn from samples which have the same characteristics. A small probability value thus inIf this

dicates that the two curves are actually significantly different.

probability for one measure is, for example, 5%, the difference in the two average values of that measure is said to be "significant at the 5% level".

\

original curve interpolated curve

40

50 60 Recall

80

90

100

An Illustration of the Interpolation Method Used for the Quasi-Cleverdon Recall-Precision Averages Fig. 1

original curve interpolated curve

10

20

30

40

50 60 Recall

70

80

90

100

An Illustration of the Interpolation Method Used for the New Recall-Precision Averages Fig. 2

VIII-14

Choice of a statistical method for calculating this probability is important. The present study uses two statistical tests, the familiar

T-test and the Wilcoxon Signed-Rank Test (WSR) [8]. The T-test takes account of the magnitude of the differences, and assumes that the measures tested are normally distributed. The WSR test does not make this assumption.

However, the WSR test takes account only of the ranks of the differences, ignoring their magnitude. Because this test does not assume normality

of the input and because it ignores some information (magnitudes of differences), the WSR test is more conservative than the T-test. It is

therefore less prone to the error of calling a result "significant" when it is not. Because information retrieval provides discrete rather than

continuous data, and because only 42 data points (42 queries) are provided, the more conservative WSR test is preferable in the present evaluation.

5.

Experimental Results Results of three major areas of investigation are presented:

a)

A comparison of two strategies that use only R' , the set of relevant documents retrieved, to modify the query vector.

b)

An investigation of the retrieval effect of the number of documents used for relevance judgment feedback on each iteration.

c)

An investigation of strategies using the set S 1 , that is, the nonrelevant documents retrieved, to modify the query. The

statistical significance of the average results obtained is tested in each case.

VIII-15

A)

Two Strategies Using Relevant Documents Only

In the earlier report, summarized in section 3 [5], several strategies using relevant documents only are compared. The differences in A feed-

total performance found among these strategies were very slight.

back effect comparison of two "relevant only" strategies is made here. The strategies chosen are:

1)

The straightforward strategy of setting a equal to 1, T T equal to 1, and the other constants equal to 0, in formula B (section 2 ) . The feedback formula in effect for this strategy is therefore:

*i+l

*i

This formula is not equivalent to any strategy used in the earlier study because the feedback effect evaluation provides new documents for feedback on each iteration (section 4 A ) , while the total performance evaluation does not. The nearest

comparable strategy from the previous study is the so-called "Q 2) strategy" (also called "due only"),

A strategy that gives added weight to the user's original query: T = 1, a = 1, c = 4. T o This strategy is intended to

compensate for the large difference in magnitude between document vector weights and query vector weights (section 2 ) .

The difference in feedback effect between these two methods is trivial. For all average measures, the differences observed are less than The recall level averages for the second This

one-and-a-quarter percent,

strategy, called "Q +", are presented in Fig. 6 to be described later.

results is hardly surprising, since for several "relevant only" strategies, the total performance results reported in the earlier study are nearly identical.

VIII-16

B)

Varying the Amount of Feedback

In the earlier study (section 3 ) , the improvement in total performance achieved by feeding back ten rather than five documents to the user is impressive. effect. This difference, however, is primarily due to the ranking

The feedback effect results, shown in Fig. 3, are actually better

at medium recall levels when only five documents are used for feedback. At high recall levels, the performance ahieved by feeding back five documents twice is roughly equal to that obtained by feeding back ten documents once. The average improvement in feedback effect gained by feeding back

more documents on each iteration does not seem worth the cost of asking the user to make more relevance judgments. It must be noted, however, that the feedback effect evaluation gives an unfair advantage to runs using few documents for feedback. When

five documents are used for feedback, ranks 1-5 are frozen on the first iteration and ranks 1-10 on the second. When ten documents are fed back,

however, ranks 1-10 are frozen on the first iteration and ranks 1-20 on the second. The difference in results caused by increasing the number

of documents fed back is therefore minimized by the evaluation process. A variable feedback strategy is tested in the earlier study. The

user is asked to search the retrieved list from the top until he finds one relevant document or has seen fifteen documents. The total performance

reported indicates that the user who does not require high recall (50% recall or less) reaps considerable benefit from this strategy, but that high recall performance is very little better than that obtained by a constant feedback with N equal to 5. The feedback effect evaluation produces

VIII-17

**<s

initial search N= 5 N = IO first iteration second iteration

10 20 30 4 0 50 60 70 80 90 100 Recall

Increment Only Feedback of Five or Ten Documents Fig. 3

VIII-18

64f

• initial search o constant feedback (N=5) variable feedback

10 20 30 40 50 60 70 80 90 100 Recall

Constant Feedback Compared to Variable Feedback - First Iteration

Fig. 4

VIII-19

a different picture.

In Fig. 4, one iteration of variable feedback is The "average user"

compared to one iteration feeding back 5 documents.

must look at 4 documents from the initial search to find one relevant document. The feedback effect performance at 10% recall is better for At 20% recall, the two strategies are At medium recall levels, At recall levels

the variable feedback strategy.

the same, at 30% variable, feedback is worse.

the two strategies are again approximately the same.

above 70%, however, the variable feedback strategy gives higher precision. The variable feedback results are also affected by the feedback effect measure. Since for 75% of the queries five or fewer documents are

used for feedback, variable feedback receives the "unfair advantage" noted earlier. This advantage should, however, be most evident at low

recall, whereas the major improvement observed here is at high recall. The apparent inconsistency of a large total performance improvement (from constant feedback to variable feedback) at low recall, with a feedback effect improvement at high recall is easily explained. As was mentioned in section 4A, the feedback effect evaluation prevents a document retrieved by feedback from achieving a higher rank than any document retrieved earlier. The total performance improvement at low

recall indicates that relevant documents retrieved early (including the first relevant document retrieved, which is used for feedback) make larger jumps in rank with variable feedback than with constant feedback. These jumps are inhibited by the artificial rank ceiling imposed by the feedback method of evaluation. The feedback effect shows improvement

at high recall, indicating that even at high recall levels, new relevant

VIII-20

documents appear sooner in the retrieved list with variable feedback than with constant feedback. The total performance results favor variable The

feedback primarily for the user who does not require high recall. feedback effect results support this type of variable feedback as a strategy to be generally recommended in an interactive environment. Two further investigations must, however, be considered.

First,

the performance of several iterations of the variable feedback strategy should be investigated using the feedback effect evaluation. Second, the An

results presented here are valid for a hypothetical "average" user.

examination of subgroups of queries would show whether or not certain types of users achieve better results with the constant feedback strategy.

C)

Strategies Using Nonrelevant Documents

In the earlier report (section 3 ) , a strategy using nonrelevant documents displays a total performance similar to that achieved using relevant documents only and using twice as many documents for feedback. This nonrelevant document strategy, called "Dec Hi" in both the earlier and the present reports, uses the retrieval formula: n' r
Q i+i

=

Q

I

+

2_, r i - s i '

where

s

is the first nonrelevant document retrieved.

In the previous study, it was recommended that Rocchio's relevance feedback strategy, a strategy using all documents retrieved, be tested.

VIII-21

The present study tests Rocchio's strategy, but without normalizing the lengths of the documents used to modify the query. Since the cosine

correlation is used to rank documents for retrieval, this normalization is a theoretical necessity [2] . The results of the significance tests on the three strategies, Q + (see section 5 ) , Rocchio (Rocchio's algorithm without normalization), and Dec Hi are given in Fig. 5. Fig. 5 shows the significance levels of

the differences among the three strategies for two iterations, using the less conservative T-test. It is evident that the differences in averages

among the three strategies described are not significant. Two comparisons are of particular importance: The differences in

normalized recall between the "relevant only" strategy, Q +, and the two nonrelevant document strategies, Rocchio and Dec Hi, on the first iteration. The five percent difference between Q + and Dec Hi is significant at the 6% level, and the six percent difference between Q + and Rocchio is significant at the 3% level, according to the T-test. However, the

Wilcoxon Signed-Rank Test (WSR) indicates that the two algorithms do not give significantly different results. The significance level comparing

Q + and Dec Hi is 46%, and that comparing Q + and Rocchio is 48% o o These different significance levels must be considered in the light of the characteristics of the two significance tests. takes account of magnitude, the WSR test considers only rank. The T-test Evidently,

differences favoring Q + and differences favoring the nonrelevant document strategy are mixed in rank, producing "insignificant" results for the WSR test. Yet, some of the results favoring Q + (not all, because the

ranks are mixed) must be very large in magnitude, to give "significant"

V I I I - 22

CN
CT>

oo

^

CO

^ O O O O O C O O O ^ < X > C N O O v O O O L O O H H ^ r ^ v D H > > 4

cch

1- Ho
0

•H 4- 1 fd *H
CD H

o

LO LO

^
<T>

^>

ro

00 LD

cr>o >oo^rr^Ho >'sj rHoo

LO

O

^D O CN r H

O O H i n C N J O O O L O r H O O

|

C 1 N

1

1

1 H 1 1

e n 3 e n

>
•H

CD

H

00 rH ^0

<7i

00

o rHcr>00(Nr^r-oooo^m
LD LO

CN m

K
O

Q

c u

fd

C D

-P
H

1
j

^

^

v£> CO

(N^rmOOOOoOOOHCNH
rH 1 1 1 1

•rH

*
H

ro

T^

^ CTi

VD

0 -H

r H oo CO 0 0

o^ooHr^vocsioo^oo e^ C P i H O O ^ D O C M H O O O CM o o o o c o L n ^ o v D L o r ^ m v D

O OJ

^ +
JM

w - ^ c:
i •H ?H t-4

in

rH 0)

d Iterations
fi o u
(D CO Xf) rH
CD

•2 + >

ro r^

v.ooocr>c7»^rcTiCNOOr^cF> ^o^^srcNeNooc^cTir-oo

'
•rH

+

0

03 U
CD

-H

H3

^r
i

CO

O

O

1 s\
0)

+
*H Oi

+
•rH

+
•rH

Fir
-P fd

e n 2

-P H

I

r-

ro 1 1

cNoooocnoooor^mrHoo I i l

w l

w

u

> C O a ^\ fd
C D

- en
£

-P

0

e c n fd u
•rH MM -rH £ LO

>
•H rH

a
- en £

CDi

oo ro r»
!
1

*

e n
C D

H1 rH 00

CNi^rooHinLn'^r-cT>Ln
^ h H C O O O i n H C N h C N 00LnvDLO^rmvx)^j,«x>t^ II

C D Cn • H *H •rH

I Q ^

0
-rH

oo i n

-p fd
5M

r^

C\

- H fl
1 1
rH -rH rH

'd
fd

fa

II
rH

CN 1

Oi
rH

a^
00

LO V£) O

-P

••

*
CN rH

o
-P

rH

^

in

tf

rH

O^f

H

CN

&
C N O O C N ^ v O C N O C T i O O 1 rH rH

+
01

i

o

e n 3 e n 5M
CD i

H C D -P H

rH

CN ^r o

u o U--I
H
rH fd 0)

•H • r l •H -H -P -P -P -P
fd C D fd C D fd fd

C J tf 0 a 0 0

d
0

ex

(D Q

I ^r
I

CM 1

H M

1

u CD u CD

-P +J •P -P H H H H

o

>
-rH O O

o i
;

00

r^

<sT 00

*

«
^r
rH ^ ^ h i O r o ^ O L O
,

C

O

^

*-*
•rH

-d
1

tf

o

£ o
?M CD

C> rH T CT> <0

Ln C T > 0 O ^ 0 0 C N L O 0 0 C N L O < £ >
G \ C 0 < J \ ( J \ ( T \ ^ K O ^ i r ) \ D

C D N

o ex
C O

+
3

•H
rH fd

cn
c n
C O

+0
^

-p

H O

^

rH

cr>

O o O H H r H L O r ^ r H r ^ v D 1 1 rH

£

O 1

\D CM 1 1

n
0

u C D

cn

53
4-1

>
•H

M

C D

L

H

'
o\° O rH rH fd U C D o\° o\° o\° o\° o\° o\° o\° o\° oV> O O O O O O O O O

O

>
-H

O (D
O fd

u u

M

o

(D

• • u CU
u Hi
•

H C N o o ^ i n * x > r ^ o o c r > ro H
rH C D C D r-1

s

2

CD

•

ti
C D

tf

•
c u
0

>

U •H

fr

^ C

2

tP U 0 0

6 g • S u H
< D

•H C Cn
CO

M M

Comparing Thr

^ CO > CT>

CD rH r-

O 00

C T > L O H C T > 0 0 [ ^ ^ > 0 0 H 0 O

-P

cn a)

^r•
CM

• 0 0• LO • •• +0

^r G\ ^r

•H

r^oocT>oocnoo^oincN cooocoi^^)^LnLncno>

3 •A o o S
C)

••

edbac feren
0)

C D -p H

-p

u

no CD ^r H
in

CNJ^OiDt^CT.OO^HvDLn H

I

-r

+ •H <H Oi Oi

+

n c CD n e

^

o

fa a) a)

MM •rH

Q

VIII-23

indications on the T-test.

Thus, for a few queries, the Rocchio and

Dec Hi algorithms must be much less effective than Q + as measured by normalized recall. The total performance normalized recall measured in the previous study (section 3) was also low for Dec Hi, compared with a "relevant only" strategy. The explanation given in the earlier report is equally valid In brief, the use of

for the feedback effect results reported here.

nonrelevant documents for feedback seems to raise the ranks of fairly high-ranking relevant documents, and at the same time, lower the ranks of some low-ranking relevant documents. The significance levels obtained by comparing the first and second iteration results to the initial search result, within each of the three strategies, are very informative. Figs. 6, 7, and 8, show the performance The-significance of

of algorithms Q +, Rocchio, and Dec Hi respectively.

the gap between the initial search and each iteration is tested, using the more conservative WSR test. Looking at the three recall-precision graphs, the average performance of the three algorithms seems quite similar. Fig. 5 shows that, Yet,

in fact, the differences in average performance are not significant. the significance levels displayed in Fig. 6 differ greatly from those displayed in Figs. 7 and 8.

For the Q + strategy, the differences between the initial search and each feedback iteration are significant. On the first iteration, the

four overall measures and the precision differences from 20% through 50% recall are significant at the 5% level or less, and only at 70 and 80%

VIII-24

£zz£

Q0+
Qo + :

Algorithm

Qj+l = Qj +4Q 0 + Srj o initial search —>^—first iteration --A—second iteration

10 20 30 4 0 50 60 70 80 90 100 Recall

First Iter. vs. Initial Search Percent Percent Difference Significance Rank Recall 4.1 3.8 Log Precision 4.1 2.3 Normalized Recall 3.1 0.6 Normalized Precision 2.7 1.8 6.8 Recall Level 10% 1.0 4.3 2 0 % 1.8 1.5 3 0 % 2.2 0.5 4 0 % 3.0 0.8 5 0 % 3.1 5.7 6 0 % 3.5 7 0 % 2.4 29.2 8 0 % 3.2 18.3 6.1 9 0 % 3.6 6.1 100% 3.6

Second Iter. vs. .Initial Seared Percent Percent Significance Difference 1.9 5.1 1.2 2.8 0.1 3.7 0.5 3.6 1.3 1.8 0.8 2.5 0.9 3.0 0.4 3.7 I.I 3.7 2.8 5.7 19.5 4.4 4.5 4.9 2.4 5.6 2.6 5.2

Comparison of First and Second Iterations to Initial Search Differences and Significance Levels - Q + Algorithm o Fig. 6

VIII-25

60

kz^£^. A

AS

\

\

Rocchio Algorithm o initial search — A — first iteration 'S>A — A — second iteration

50
o
CO

£40

Rocchio: Qi+I =n r n sQi \ n'sSn-nrSj
A^

+

30

10 20 30 4 0 50 60 70 80 90 100 Recall

First Iter. vs. Initial Search Second Iter. vs. Initial Search Percent Percent Percent Percent Difference Significance Difference Significance 5.4 5.3 Rank Recall 19.1 4.0 2.6 8.2 Log Precision 1.8 32.6 23.5 -0.8 Normalized Recall -3.0 60.0 1.6 20.8 Normalized Precision -0.1 55.5 30.7 1.6 Recall Level 10% 1.0 51.8 12.4 2.8 25.8 20% 2.1 5.3 3.2 14.9 2.3 30% 4.1 2.2 10.4 2.9 40% 3.0 4.3 8.8 3.2 50% 28.7 4.5 47.7 3.1 60% 5.4 12.0 43.2 3.1 70% 8.5 5.8 22.1 4.3 80% 5.7 5.1 4.4 19.1 90% 5.3 7.4 21.1 100% 4.2
Comparison of First and Second Iterations to Initial Search Differences and Significance Levels - Rocchio Algorithm Fig. 7

I

60?

Dec Hi Algorithm Dec Hi: Qi+i
= +

Qj

2n-si

50h
o
o 0)40
Q. —

initial search first iteration --second iteration

30h x 10
± ±
J L J_ J

•V.
L

20 30 4 0 50 60 70 80 90 100 Recall

X

-o

First Iter. vs. Initial Search Second Iter.vs.Initial Search Percent Percent Percent Percent Difference Significance Difference Significance 9.5 13.6 4.8 Rank Recall 4.3 2.6 9.2 Log Precision 14.5 2.3 19.6 -3.3 Normalized Recall -2.3 45.1 14.1 0.6 25.9 Normalized Precisioni 1.5 21.5 1.6 30.9 Recall Level 10% 1.2 8.4 2.4 2.9 13.3 20% 7.1 3.3 8.3 2.8 30% 4.7 3.7 5.5 1.8 40% 4.1 4.5 2.3 50% 4.0 4.4 6.5 5.2 14.5 60% 16.7 2.8 39.1 5.1 70% 8.5 16.7 5.4 4.3 80% 10.5 4.2 4.6 17.3 90% 10.9 4.1 4.5 17.9 100%
Comparison of First and Second Iterations to Initial Search Differences and Significance Levels - Dec Hi Algorithm Fig. 8

VIII-27

recall are the precision differences not significant at the 10% level*. On the second iteration, the performance difference is significant at the 5% level for all points except 70% recall. For the Rocchio strategy,

however, only one measure (precision at 50% recall) shows a significant difference between the first iteration and initial search at the 10% level or less. Even on the second iteration, only 8 of the 14 differences The

are significant at the 10% level or less, two at the 5% level or less.

significance of the corresponding differences for the Dec Hi strategy are similar. This difference between strategies in the significance of the improvement obtained by feedback leads to a general conclusion: Per-

formance on all measures is less consistent for the nonrelevant document strategies than for the Q + strategy. However, since the average

magnitude of this improvement is equal for the three algorithms (from the significance results presented in Fig. 5 ) , it must be true that the Rocchio and Dec Hi strategies are better for some queries and worse for others than is the more consistent Q + strategy. The greater variance of nonrelevant document strategies is therefore demonstrated not only by the normalized recall, but by all performance measures. The results given here seem to indicate that for some

types of queries, nonrelevant documents should be used for feedback, but

For these comparisons, a one-tailed significance level is appropriate, since performance is expected to improve. To obtain one-tailed values, the reported two-tailed values must be divided by two. That is, the probability that the first iteration is no better than the initial search is 5% or less except at 70 and 80% recall.

VIII-28

for others, only relevant documents should be used.

If the queries approp-

riate to each strategy could be distinguished easily before the retrieval operation, performance of the system could be improved by choosing the appropriate strategy for each query. Procedures for distinguishing such

subgroups of queries must be investigated.

6.

Summary and Recommendations The isolation of the feedback effect adds to an understanding of

relevance feedback in an automatic interactive retrieval system.

The

present investigation supports the earlier finding [5] that changing the constant formula parameters in the simplest algorithm — using only relevant documents — has little effect on retrieval. This study contra-

dicts the earlier conclusion concerning the optimum amount of feedback. Looking only at the feedback effect, returning ten rather than five documents no longer seems worth the extra user effort. However, feedback

effect evaluation tends to minimize differences caused by varying the number of documents used for feedback. The combination of the results for total performance and feedback effect favors the general use of "variable feedback", in which the user searches the retrieval list only until one relevant document is found. In feedback effect, strategies using nonrelevant documents no longer display an average performance superior to strategies using relevant documents only. Significance tests indicate that "relevant only" strategies

are superior for some queries and nonrelevant document strategies for others.

VIII-29

Several areas of investigation are recommended.

At least one more

iteration of variable feedback, and at least two iterations of the "combination strategy" recommended in the earlier total performance study [5] should be investigated using feedback effect evaluation. Significance tests on Rocchio's strategy

the variable feedback results should also be obtained. with normalization should be investigated.

Results of comparing relevant only and nonrelevant document strategies strongly suggest that an investigation be made of subgroups of queries. These types of algorithms seem appropriate to different groups It would be useful to be able to choose the appropriate For

of queries.

strategy for a query before retrieval, by examination of the query.

variable feedback, an investigation of query subgroups should be performed to determine whether or not some identifiable group of users is shortchanged by this strategy. The use of query subgroups, however, raises questions of sample adequacy. Even if the 200 documents and 42 queries of the Cranfield 200

collection are representative of a typical retrieval environment, the statistical dangers of dividing 42 queries into small subgroups must be considered. Investigation of relevance feedback should therefore be The Cranfield 1400 document

continued in a more adequate environment.

collection, available to the SMART system would provide significantly larger samples of documents and queries.

References

[1]

J. J, Rocchio, Relevance Feedback in Information Retrieval, Scientific Report No. ISR-9 to the National Science Foundation, Section III, Harvard Computation Laboratory, August 1965. J. J. Rocchio, Document Retrieval System Optimization and Evaluation, Harvard University Doctoral Thesis, Scientific Report No. ISR-10 to the National Science Foundation, Harvard Computation Laboratory, March 1966. J. J. Rocchio, G. Salton, Search Optimization and Iterative Retrieval Techniques, Proceedings of the Fall Joint Computer Conference, Las Vegas, November 1965. W. Riddle, T. Horwitz, R. Dietz, Relevance Feedback in Information Retrieval Systems, Scientific Report No. ISR-11 to the National Science Foundation, Section VI, Department of Computer Science, Cornell University, June 1966. E. Ide, User Interaction with an Automated Information Retrieval System, Scientific Report No. ISR-12 to the National Science Foundation, Section VIII, Department of Computer Science, Cornell University, June 1967. H. Hall, N. Weiderman, The Evaluation Problem in Relevance Feedback Systems, Scientific Report No. ISR-12 to the National Science Foundation, Section XII, Department of Computer Science, Cornell University, June 1967. G. Salton, The Evaluation of Automatic Retrieval Processes, Selected Test Results using the SMART System, American Documentation, Vol. 16, No. 3, July 1965. F. Wilcoxon, Some Rapid Approximate Statistical Procedures, American Cyanamid Company, Stamford, Connecticut, 1949.

[2]

[3]

[4]

[5]

[6]

[7]

[8]