19.

Experiments

3.1

The Test Collections

Experiments were done on 3 test collections; NPL, Evans and Evans Titles. Their sizes are summarised in the following table.

DOCUMENTS

QUERIES

RELEVANCE DOCUMENTS

No. NPL EVANS EVANS TITLES 11429 2542 2542

MAX 105 57 18

MIN 1 3 0

AV. 19.96 16.12 6.51

No. 93 33 33

MAX 10 36 35

MIN 2 12 12

AV. 7.1 24.5 23.58

MAX 84 53 53

MIN 1 3 3

AV. 22.4 24.18 24.18

Notes

The NPL collection is almost

identical to the collection used The only change made

by Robertson, van Rijsbergen and Porter (1981).

to the collection was the removal of terms (a total of 4) from the queries to make the large queries slightly shorter. More information

about the collection can be found in Vaswani and Cameron (1970).

The "Evans" and "Evans Titles" collections are based on the collections used by Sparck-Jones and Webster (1980), but the collections used by us have been modified in the following ways :-

(i)

A fairly large stop-list (about 300 words) was used to remove common and non-useful words (like please) from documents and queries.

20.

(ii)

Documents and queries terms were stemmed Martin Porter's

using

stemming algorithm (Porter 1980)

(iii)

The six largest queries were discarded.

This was

necessary because our program is limited to a maximum query size of 36 terms. This limit was

imposed so that we could represent interactions by bit-maps.

The original documents consisted of a title followed by a number of manually chosen keywords and phrases. For the "Evans"

collection the title and keywords were merged together and used as the document description. only the title was used. For the "Evans Titles" collection

The small difference between the queries

of the two collections arises because terms that didn't occur in any document were removed from the queries. It should also be

noted that our two collections differ from those used by Sparck Jones and Bates in that we used the "need statement" form of the request in both cases, but in one case used the manual indexing as well as titles. Further information albout this

collection can be found in Evans (1975 a,b).

3.2

The Experiments

We did two different types of feedback experiments : the usual half-collection experiments and continued-searching experiments in which an initial search is done to obtain feedback information which is used to complete the search.

3.2.1

Half-collection experiments

These were done in the usual way.

First the document collection

was divided into two parts, odd and even numbered documents. In some experiments an initial search is done on one half of the collection and a small number (usually 20) documents are retrieved. The

relevant documents ix those 20 are then used to recalculate the weights

21.

which are evaluated the collection.

by doing a full search on the other half of

In the rest of the experiments the relevant documents

in the whole of one half of the collection are used to calculate weights which are then evaluated on the other half of the collection.

3.2.2

Continued searching

experiments

These are intended to be an imitation of what would happen if relevance feedback was used in an actual retrieval search.

First a few documents are retrieved (5,10,15 or 20) using the best weights available without relevance information. Any.

relevant documents found are then used to recalculate the weights and the search is continued for the rest of the collection. searches that are then compared are those in which The two

(i)

the first few documents are retrieved without relevance informatin and then the remainder are retrieved using information obtained in the first part of the search.

(ii) the first few documents are retrieved as above but then the search is continued without changing the weights.

Incidently, there must be an optimum size search. If it consists of

for the initial

zero documents there will be no

relevance information and if it consists of the whole collection there will be no continued search. (i) and (ii) will be identical. In both cases the two searches

3.2.3

Evaluation

Our principal method of evaluation was to produce recall-precision graphs. method These were obtained using the standard recall cutoff (van Rijsbergen 1979) and average values of precision were

calculated

for recall at 10% intervals from 10% to 90%.

22.

The only siginficance test with any theoretical justification which is applicable to these experiments is the sign test. is based on the fact that, given the This

null hypothesis that one

sort of query-weight is no better than another, we would expect the number of queries in which the first type of weight does better to have a binomial distribution with probability 0.5.

To do this test we need some overall measure of retrieval performance that can be of weights. used , query by query,to compare two sets

There are plenty of possibilities, for example we at some fixed level of recall, but all our

could use precision

tests were done with two measures : normalised recall and normalised precision. Both measures are based on the whole

recall-precision graph but normalised precision is biased to the low recall, high precision end of the graph (van Rijsbergen, 1979)

3.3

Logistic Weights

For the simple linear logistic model, without interactions but with a normal prior, the logistic weights are those which maximise g(w) defined in equation (4) section 2.3. That is

g(w) = ^> ^/t ~ ^nT
teQ

l

°3(1+

ex

P ^ > V " ~^T
tcT teQ

T TsQ

w

t r t

is the weight associated with term t is the number of known relevent documents containing term is the set of terms being considered, in this case the set of terms in the query is an arbitary subset of Q. is the number of documents which contain all the the terms in and none of the terms in Q\T are the means and variance for the assumed normal prior (see 3.3.2)

Q T

nT

Va

23.

Newton's method was used to maximize g(w.) . This is an iterative algorithm (k+l) (k) which takes the following form. ,n, (k) M -2 , (k)}

where J? ( ; ) is the vector of first partial derivatives of &> and Hjw) of g.

g,

is the Hessian, or matrix of second partial derivatives

ou

is a step

length, and is chosen to make

, (k+l) .

, (k) ,

As ou > 0.

#(W

is negative definite we can always find an available for choosing

There are a number of algorithms

a^ but we used the following rather crude one. = 1
(k+1

(1) (2) (3)

Set a, If g(w

h

> g(w(k)) = 0.8

then use that value of * &i, and go to (2)

ax,

Otherwise set cu

This is certain to find an adequate to find the best one. the first few In fact ou

a. 1

but it is unlikely for all but

will equal

iterations. was continued until w - j£ \\ „ < e

The iteration where e

is a fixed small number (we used e = As we used 2 a . W_ =

1 x 10""') . The y this is strongly _

number of iteration steps is from the final value. dependent on the size of

needed depends mostly on how far w_'0'

The amount of work that has to be done for each iteration depends mostly on the size of in the query. of that size to limit our query size Q, that is on the number of terms Because of the way our queries were stored we had to a maximum of 36 terms and for queries convergence took typically 3-4 seconds of CPU time on

24.

the Honeywell 6000.

For large queries most of the time Is

probably spent Inverting the Hessian and the algorithm could probably be speeded up by using the same Hessian for several iterations.

3.3.1

The evaluation of

b and E

If the formula for g(w) following forumulae for

above is differentiated we get the H (w)

b_. (w) and

exp ( J> w ) T teT (1 + exp( yw ueT )) a

u

H Jw)

=

-

> n

eirv( y w ) u (1 + exp( yw ueT

. ))6

uir

fa*
o

T teT and seT
where

5

8 = t

„t
A table of v,

8 * *

and of non-zero values of n^

needs

to be evaluated and stored just once for each query.

3.3.2

The prior distribution

In the present experiments, we have generally assumed that the Sparck Jones collection frequency weights provide appropriate prior information about the relevance weights. mean y In other words, the prior

is taken to be the collection frequency weight according to It will be clear that the logistic

expression (2) of section 1.2.2.

25.

model is neutral In this respect : any particular formula could be built into the prior. The variance of the prior is interpreted as measuring the relative reliance that is to be placed on the prior mean, relative to direct evidence from relevance feedback (a low variance implies high reliance on the prior mean). variances are described. Experiments with different

3.3.3

The inclusion of interactions

It is quite easy to include specific term interactions into a query. For example the inclusion of a two-term interaction will have exactly the same effect on the model as the addition of an extra term which is contained in all the documents which contain both the terms. Hence we can treat interactions as "pseudo-terms" in much the same way that wet include a term tQ which is assumed to be in every document.

3.4

Results of the experiments The weighting systems used in the experiments are as follows : word : unit weights are used (equivalent to level of coordination). frequency : collection frequency weights are used, according to expression (2) of section 1.2.2. RSJ : weights are calculated according to the Robertson/Sparck Jones model, "0.5" formula (expression (1) of section 1.2.1), Complement method for non-relevant documents. Logistic : weights are calculated according to the logistic V = x model, normal prior with means equal to the frequency weights, variance x , maximum posterior estimates, complement method for non-relevant documents.

26.

3-4.1

Continued searching experiments

In all of these the initial search (retrieving 5,10, 15 or 20 documents) is done using the frequency weights. Then the search

is continued using weights obtained from feedback information obtained in the initial search.

For each experiment we list, in order,

(i)

the length of the initial search

(ii) the precision values for recall values of 10%, 20% ....,90%

(iii)significance levels for the sign test of the null hypothesis that feedback produces no improvement. Two values are

given, using normalised recall and normalised precision respectively to compare performance. given as a percentage. Both values are

Hence if both (or either) value null hypothesis

is less than 1 say we would reject the at the 1% level.

The NPL Collection Search without feedback (frequency weights). 54 45 37 31 24 18 15 11 6

RSJ Weights

5 10 15 20

56 56 55 55

47 47 47 47

40 40 40 39

32 32 33 32

26 26 26 26

20 19 20 20

15 15 16 16

11 11 12 12

7 7 7 8

0.2 3 0.2 0.6

0.2 3 0.2 0.6

27.

Logistic weights V = 1

5 10 15 20

55 55 55 55

45 46 46 45

38 39 38 38

31 32 32 32

25 25 26 26

18 19 19 19

15 15 15 15

11 11 11 11

6 7 7 6

86 24 30 50

27 5 2 7

The EVANS Collection 62 48 39 33 27 22 13

RSJ weights
1 5 10 15 20 62 65 63 63 63 51 52 51 50 50 41 45 44 43 42 35 38 38 38 37 29 32 33 33 32 22 26 26 26 26 14 15 16 16 17 9 11 12 12 13 6 8 7 7 8 5 0.2 0.3 0.02 0.01 3 0.2 0.2 0.1 0.01

Logistic weights V
5 10 15 20 62 63 63 63 50 49 49 49 40 41 41 42 33 35 36 35 28 29 29 30 22 23 23 23 14 15 15 15 9 10 10 10 5 6 6 6 50 3 0.6 0.03 100 12 6.2 1.6

The EVANS TITLES Collection 48 35 S3 ia 11 7 5 4 2

28.

RSJ weights

5 10 15 20

50 49 48 49

37 37 37 36

27 27 26 27

18 19 18 18

12 12 12 13

8 8 8 8

6 6 6 6

4 4 4 4

2 2 2 2

50 34 34 50

14 36 50 14

Logistic weights y

=

1

5 10 15 20

48 48 48 48

36 36 35 35

23 24 25 25

15 16 16 16

11 11 11 11

7 7 7 7

5 5 5 5

4 4 4 4

2 4 2 2

75 75 50 50

75 75 50 75

3.4.2

Half Collection experiments

We have grcwped these according to the half-collection uaad to evaluate the weights. For each experiment we give the source of

the weights and then the average precision values for recall at 10% intervals from 10% to 90%.

When giving the source of the weights the following abreviations are used.

even, odd :

All the known relevant documents in the half collections (either even or odd numbered documents) are used to calculate the term weights.

even/20 odd/20

:

A preliminary search is done on the specified

: half collection using unit weights and the relevant documents found among ; the first 20. retrieved documents are used the weights. I to calculate

29.

The NPL Collection, even numbered documents

word frequency RSJ, odd logistic v = 4, odd RSJ, odd/20 logistic v = 4, odd/20

51 57 62 66 60 61

43 48 56 57 52 52

36 39 49 48 43 42

28 31 39 40 35 34

24 26 34 34 30 29

20 23 29 29 25 25

14 17 21 22 17 19

11

7

14 10 17 12 18 12 14 10 14 10

NPL, odd numbered documents

word frequency RSJ, even logisitic V = 4, even RSJ, even/20 logistic v = 4, even/20

49 57 65 66 59 61

39 48 56 58 50 51

29 39 46 47 40 42

24 33 38 41 34 36

22 28 34 37 29 31

16 21 26 29 22 24

12 17 21 22 18 18

8 12

5 8

15 10 16 13 12 9 8 7

EVANS, even numbered documents

word frequency RSJ, odd logistic logistic logistic logistic V = 1, pdd V = 4, odd V = 10, odd V = 20, odd

55 58 74 64 67 70 73 69 61 62

46 53 69 57 64 68 69 64 54 57

35 39 63 47 54 58 60 56 43 46

32 35 54 45 49 50 51 48 38 40

27 32 46 39 44 44 44 38 33 35

16 21 36 28 30 31 30 27 21 23

14 17 26 22 23 23 22 19 17 19

11 13 21 15 16 18 16 15 13 14

8 9 14 12 13 14 12 11 10 11

RSJ, odd/20 logistic logistic V = 1, odd/20 V = 4, odd/20

EVANS, odd numbered documents

word frequency RSJ, even logistic logistic logistic logistic V = 1, even V = 4, even V = 9, even v = 25, even

63 70 72 69 69 68 69 68 69 66

51 60 66 60 61 59 58 61 61 60

40 47 57 52 52 52 51 51 48 47

35 42 53 49 48 46 46 45 44 43

30 35 44 42 41 40 39 35 37 36

21 24 37 30 32 32 31 29 27 28

14 15 26 19 21 21 22 20 17 16

10 12 20 15 16 17 17 17 14 14

6 8 10 8 8 9 9 10 9 7

RSJ, even/20 logistic logistic V = 1, even/20 V = 4, even/20

31.

EVANS-TITLES, even numbered documents

Word frequency RSJ, odd logistic logistic V = 1, odd V = 4, odd

42 50 58 54 57 54 54 55

34 35 49 39 45 45 39 44

27 27 38 31 32 33 30 32

21 22 27 24 26 26 24 24

16 20 24 22 23 23 22 22

11 13 15 14 16 15 14 15

9 10 12 11 12 12 11 12

7 7 8 8 8 8 8 8

6 6 7 6 7 7 7 7

RSJ, odd/20 logistic logistic V = 1, odd/20 v = 4, odd/20

EVANS-TITLES, odd numbered documents

Word frequency RSJ, even logistic logistic V = 1, even V = 4, even

50 54 65 56 58 59

35 43 58 50 48 52

22 27 43 34 36 33 27 29

18 20 30 23 26 25 20 20

12 14 22 15 17 19 13 14

8 8 14 11 12 12 9 9

5 4 7 6 6 5 5 5

3 3 5 5 4 5 5 4

2 2 3 3 3 3 3 3

RSJ, even/20 logistic logistic V = 1, even/20 V = 4, even/20

54 43 54 44

32.

3.5

Survey of the results

3#5#1

Continued searching experiments

This form of evaluation, it should be pointed out, is a new one, proposed as an alternative to the traditional "residual ranking" experiments. In the present experiments,

the documents used for feedback are retained in the evaluation set. We suggest that this procedure is more realistic (i.e.

simulates a real-life search better) than residual ranking.

As might be expected, the continued searching experiments show smaller performance differences between feedback and no feedback than residual ranking. Feedback always improves performance a RSJ weights, always perform

little, but often not significantly. better

than logistic weights with the same feedback set, though A detailed examination of the

again often not significantly.

figures suggests that (a) the larger the feedback set, the better the performance at the tail (high recall) end of the curve; (b) a small feedback set seems likely to produce a substantial performance improvement in just a few queries; a larger feedback set gives slighter improvements spread over a larger number of queries (particularly evident in Evans). Some of these points are

illustrated in Fig. 1, which gives some results from the Evans collection.

3.5.2

Half-collection experiments

It has proved very difficult to generalize from these results. Out of the six sets of experiments, one can find examples

where RSJ outperforms logistic and examples of the opposite; examples where higher-variance priors for the logistic improve performance and examples where they depress it. However, in

general the results show no strong evidence for logistic weights.

31
l/)
-^

O
c-J

0

<A
-

O
•"

VT\

*J

s* 3

5-N

<5 • ->
- J

*" "

0 < 0

i.
o

0
0

o 0

O

£ c* ^ S
O
o

°
v4>|

3f

O

o

o
vA

O

o
v^5

VA

O

O

O

'1 ore.

34.

3,5#3

Other experiments

Some further experiments were performed early in the project, as a basis for further work. The results of these however, it is

experiments are not presented in detail here; worth indicating some of the main findings.

Some experiments were done on the inclusion of interactions : selected pairs of query terms were included as indicated in section 3.3.3. The pairs were selected by inspection, as likely to have

meaning as a phrase which is not apparent in the separate words (e.g. high, frequency). All these experiments were negative in

the sense that including interactions depressed performance; the more interactions were included, the more performance was depressed. This somewhat strange result lead us to attempt a theoretical answer to the question : When should an interaction be included ? The resulting theoretical development is presented and discussed in section 4.

Some experiments were done with the prior for the logistic model having zero mean (rather than collection-frequency based) and large variance - i.e. as neutral a prior distribution as possible. These performed worse than the frequency-based mean.

No experiments, however, have been done using other forms of prior, such as constant (non-zero) mean or a peaked function such as Salton's.

We attempted to get some direct evidence on the possible ideal form of frequency-based prior. We plotted, for individual

query terms, the neutral-prior retrospective logistic weight against the collection-frequency weight (Fig.2). This scatter diagram

is notable only for showing virtually no pattern - there is a very slight positive correlation, but certainly no possibility of distinguishing between alternative relationships. We considered

the possibility that this lack of pattern was due to variations between queries : that is, that collection-frequency weights might be a good predictor of the relative values of different terms in a

35.

given query, but not their values in relation to terms from other queries. So we normalized the weights for each query. The result

is given in fig.3; the correlation is a little highter, but still not really good enough to draw any useful conclusions. This area is discussed further in section 5.

36.

o

s
0

o

Ji

i

0
O

O

x r\t x x

x x rvi

x

x X IM X

x rvj x

x

r\j X

x X

X X

X X

x CM : X X f\j x *\j rv : rvj

K» X X X

X

IM

x X X X rvj »o X X Kt X X rvi
K»

X fV) X X X X X X

X

X

V"S

x

x

x

IM

X K»

x rvi rvj x

x
IM

% • — '
X

*> "
X X X X X
X

m X I M X rw X X

X

X

rsj X rvj
r\j

X X X X
t .«- O rxj «•- CM 00 IM 1-^ «; i/> c H^ •« r > O fO *M t/» rvi

X X X

X

Kl

> *

IM X

X

K> X

4

x x

xrxjx

x >» I M

M X

rv

> »

a

ficLurz.

2.

37
o

1!
j"

X

X

X

X

X X X X X
X

X
X X

MX
XX CM

X X X X

X

X

fM

X

fM

X X X X
f\J

X

rO

>»

X

X

I

X X X X X

X

X K» X X X X

X X

M M K>

X X X X X
U

x
X X I

x

:

X

X

:x x
X fM

X X X X X fM X

CM X X fM K) M X X

K XIMMK

X
• X
rvi

O I

x

x

XXX4

X X X X X X fM X X

f\j X

X fM X X

fM
X X X

fM X X fM

X X X r\j X X •sf

X

o
fM •*» X X X X t

x

x

x fM x

x
fM X X

xx
X

5^
X X X X

o >* • o
V
fM • M fM fM O

5
•
OJ

I

— «•M

o >» f> m 4, • i. <* <• o
Wf> ¥• r-

K> • 00 » I

~

"

1

Ft

Vf<L