19. Experiments 3.1 The Test Collections Experiments were done on 3 test collections; NPL, Evans and Evans Titles. Their sizes are summarised in the following table. DOCUMENTS QUERIES RELEVANCE DOCUMENTS No. NPL EVANS EVANS TITLES 11429 2542 2542 MAX 105 57 18 MIN 1 3 0 AV. 19.96 16.12 6.51 No. 93 33 33 MAX 10 36 35 MIN 2 12 12 AV. 7.1 24.5 23.58 MAX 84 53 53 MIN 1 3 3 AV. 22.4 24.18 24.18 Notes The NPL collection is almost identical to the collection used The only change made by Robertson, van Rijsbergen and Porter (1981). to the collection was the removal of terms (a total of 4) from the queries to make the large queries slightly shorter. More information about the collection can be found in Vaswani and Cameron (1970). The "Evans" and "Evans Titles" collections are based on the collections used by Sparck-Jones and Webster (1980), but the collections used by us have been modified in the following ways :- (i) A fairly large stop-list (about 300 words) was used to remove common and non-useful words (like please) from documents and queries. 20. (ii) Documents and queries terms were stemmed Martin Porter's using stemming algorithm (Porter 1980) (iii) The six largest queries were discarded. This was necessary because our program is limited to a maximum query size of 36 terms. This limit was imposed so that we could represent interactions by bit-maps. The original documents consisted of a title followed by a number of manually chosen keywords and phrases. For the "Evans" collection the title and keywords were merged together and used as the document description. only the title was used. For the "Evans Titles" collection The small difference between the queries of the two collections arises because terms that didn't occur in any document were removed from the queries. It should also be noted that our two collections differ from those used by Sparck Jones and Bates in that we used the "need statement" form of the request in both cases, but in one case used the manual indexing as well as titles. Further information albout this collection can be found in Evans (1975 a,b). 3.2 The Experiments We did two different types of feedback experiments : the usual half-collection experiments and continued-searching experiments in which an initial search is done to obtain feedback information which is used to complete the search. 3.2.1 Half-collection experiments These were done in the usual way. First the document collection was divided into two parts, odd and even numbered documents. In some experiments an initial search is done on one half of the collection and a small number (usually 20) documents are retrieved. The relevant documents ix those 20 are then used to recalculate the weights 21. which are evaluated the collection. by doing a full search on the other half of In the rest of the experiments the relevant documents in the whole of one half of the collection are used to calculate weights which are then evaluated on the other half of the collection. 3.2.2 Continued searching experiments These are intended to be an imitation of what would happen if relevance feedback was used in an actual retrieval search. First a few documents are retrieved (5,10,15 or 20) using the best weights available without relevance information. Any. relevant documents found are then used to recalculate the weights and the search is continued for the rest of the collection. searches that are then compared are those in which The two (i) the first few documents are retrieved without relevance informatin and then the remainder are retrieved using information obtained in the first part of the search. (ii) the first few documents are retrieved as above but then the search is continued without changing the weights. Incidently, there must be an optimum size search. If it consists of for the initial zero documents there will be no relevance information and if it consists of the whole collection there will be no continued search. (i) and (ii) will be identical. In both cases the two searches 3.2.3 Evaluation Our principal method of evaluation was to produce recall-precision graphs. method These were obtained using the standard recall cutoff (van Rijsbergen 1979) and average values of precision were calculated for recall at 10% intervals from 10% to 90%. 22. The only siginficance test with any theoretical justification which is applicable to these experiments is the sign test. is based on the fact that, given the This null hypothesis that one sort of query-weight is no better than another, we would expect the number of queries in which the first type of weight does better to have a binomial distribution with probability 0.5. To do this test we need some overall measure of retrieval performance that can be of weights. used , query by query,to compare two sets There are plenty of possibilities, for example we at some fixed level of recall, but all our could use precision tests were done with two measures : normalised recall and normalised precision. Both measures are based on the whole recall-precision graph but normalised precision is biased to the low recall, high precision end of the graph (van Rijsbergen, 1979) 3.3 Logistic Weights For the simple linear logistic model, without interactions but with a normal prior, the logistic weights are those which maximise g(w) defined in equation (4) section 2.3. That is g(w) = ^> ^/t ~ ^nT teQ l °3(1+ ex P ^ > V " ~^T tcT teQ T TsQ w t r t is the weight associated with term t is the number of known relevent documents containing term is the set of terms being considered, in this case the set of terms in the query is an arbitary subset of Q. is the number of documents which contain all the the terms in and none of the terms in Q\T are the means and variance for the assumed normal prior (see 3.3.2) Q T nT Va 23. Newton's method was used to maximize g(w.) . This is an iterative algorithm (k+l) (k) which takes the following form. ,n, (k) M -2 , (k)} where J? ( ; ) is the vector of first partial derivatives of &> and Hjw) of g. g, is the Hessian, or matrix of second partial derivatives ou is a step length, and is chosen to make , (k+l) . , (k) , As ou > 0. #(W is negative definite we can always find an available for choosing There are a number of algorithms a^ but we used the following rather crude one. = 1 (k+1 (1) (2) (3) Set a, If g(w h > g(w(k)) = 0.8 then use that value of * &i, and go to (2) ax, Otherwise set cu This is certain to find an adequate to find the best one. the first few In fact ou a. 1 but it is unlikely for all but will equal iterations. was continued until w - j£ \\ „ < e The iteration where e is a fixed small number (we used e = As we used 2 a . W_ = 1 x 10""') . The y this is strongly _ number of iteration steps is from the final value. dependent on the size of needed depends mostly on how far w_'0' The amount of work that has to be done for each iteration depends mostly on the size of in the query. of that size to limit our query size Q, that is on the number of terms Because of the way our queries were stored we had to a maximum of 36 terms and for queries convergence took typically 3-4 seconds of CPU time on 24. the Honeywell 6000. For large queries most of the time Is probably spent Inverting the Hessian and the algorithm could probably be speeded up by using the same Hessian for several iterations. 3.3.1 The evaluation of b and E If the formula for g(w) following forumulae for above is differentiated we get the H (w) b_. (w) and exp ( J> w ) T teT (1 + exp( yw ueT )) a u H Jw) = - > n eirv( y w ) u (1 + exp( yw ueT . ))6 uir fa* o T teT and seT where 5 8 = t „t A table of v, 8 * * and of non-zero values of n^ needs to be evaluated and stored just once for each query. 3.3.2 The prior distribution In the present experiments, we have generally assumed that the Sparck Jones collection frequency weights provide appropriate prior information about the relevance weights. mean y In other words, the prior is taken to be the collection frequency weight according to It will be clear that the logistic expression (2) of section 1.2.2. 25. model is neutral In this respect : any particular formula could be built into the prior. The variance of the prior is interpreted as measuring the relative reliance that is to be placed on the prior mean, relative to direct evidence from relevance feedback (a low variance implies high reliance on the prior mean). variances are described. Experiments with different 3.3.3 The inclusion of interactions It is quite easy to include specific term interactions into a query. For example the inclusion of a two-term interaction will have exactly the same effect on the model as the addition of an extra term which is contained in all the documents which contain both the terms. Hence we can treat interactions as "pseudo-terms" in much the same way that wet include a term tQ which is assumed to be in every document. 3.4 Results of the experiments The weighting systems used in the experiments are as follows : word : unit weights are used (equivalent to level of coordination). frequency : collection frequency weights are used, according to expression (2) of section 1.2.2. RSJ : weights are calculated according to the Robertson/Sparck Jones model, "0.5" formula (expression (1) of section 1.2.1), Complement method for non-relevant documents. Logistic : weights are calculated according to the logistic V = x model, normal prior with means equal to the frequency weights, variance x , maximum posterior estimates, complement method for non-relevant documents. 26. 3-4.1 Continued searching experiments In all of these the initial search (retrieving 5,10, 15 or 20 documents) is done using the frequency weights. Then the search is continued using weights obtained from feedback information obtained in the initial search. For each experiment we list, in order, (i) the length of the initial search (ii) the precision values for recall values of 10%, 20% ....,90% (iii)significance levels for the sign test of the null hypothesis that feedback produces no improvement. Two values are given, using normalised recall and normalised precision respectively to compare performance. given as a percentage. Both values are Hence if both (or either) value null hypothesis is less than 1 say we would reject the at the 1% level. The NPL Collection Search without feedback (frequency weights). 54 45 37 31 24 18 15 11 6 RSJ Weights 5 10 15 20 56 56 55 55 47 47 47 47 40 40 40 39 32 32 33 32 26 26 26 26 20 19 20 20 15 15 16 16 11 11 12 12 7 7 7 8 0.2 3 0.2 0.6 0.2 3 0.2 0.6 27. Logistic weights V = 1 5 10 15 20 55 55 55 55 45 46 46 45 38 39 38 38 31 32 32 32 25 25 26 26 18 19 19 19 15 15 15 15 11 11 11 11 6 7 7 6 86 24 30 50 27 5 2 7 The EVANS Collection 62 48 39 33 27 22 13 RSJ weights 1 5 10 15 20 62 65 63 63 63 51 52 51 50 50 41 45 44 43 42 35 38 38 38 37 29 32 33 33 32 22 26 26 26 26 14 15 16 16 17 9 11 12 12 13 6 8 7 7 8 5 0.2 0.3 0.02 0.01 3 0.2 0.2 0.1 0.01 Logistic weights V 5 10 15 20 62 63 63 63 50 49 49 49 40 41 41 42 33 35 36 35 28 29 29 30 22 23 23 23 14 15 15 15 9 10 10 10 5 6 6 6 50 3 0.6 0.03 100 12 6.2 1.6 The EVANS TITLES Collection 48 35 S3 ia 11 7 5 4 2 28. RSJ weights 5 10 15 20 50 49 48 49 37 37 37 36 27 27 26 27 18 19 18 18 12 12 12 13 8 8 8 8 6 6 6 6 4 4 4 4 2 2 2 2 50 34 34 50 14 36 50 14 Logistic weights y = 1 5 10 15 20 48 48 48 48 36 36 35 35 23 24 25 25 15 16 16 16 11 11 11 11 7 7 7 7 5 5 5 5 4 4 4 4 2 4 2 2 75 75 50 50 75 75 50 75 3.4.2 Half Collection experiments We have grcwped these according to the half-collection uaad to evaluate the weights. For each experiment we give the source of the weights and then the average precision values for recall at 10% intervals from 10% to 90%. When giving the source of the weights the following abreviations are used. even, odd : All the known relevant documents in the half collections (either even or odd numbered documents) are used to calculate the term weights. even/20 odd/20 : A preliminary search is done on the specified : half collection using unit weights and the relevant documents found among ; the first 20. retrieved documents are used the weights. I to calculate 29. The NPL Collection, even numbered documents word frequency RSJ, odd logistic v = 4, odd RSJ, odd/20 logistic v = 4, odd/20 51 57 62 66 60 61 43 48 56 57 52 52 36 39 49 48 43 42 28 31 39 40 35 34 24 26 34 34 30 29 20 23 29 29 25 25 14 17 21 22 17 19 11 7 14 10 17 12 18 12 14 10 14 10 NPL, odd numbered documents word frequency RSJ, even logisitic V = 4, even RSJ, even/20 logistic v = 4, even/20 49 57 65 66 59 61 39 48 56 58 50 51 29 39 46 47 40 42 24 33 38 41 34 36 22 28 34 37 29 31 16 21 26 29 22 24 12 17 21 22 18 18 8 12 5 8 15 10 16 13 12 9 8 7 EVANS, even numbered documents word frequency RSJ, odd logistic logistic logistic logistic V = 1, pdd V = 4, odd V = 10, odd V = 20, odd 55 58 74 64 67 70 73 69 61 62 46 53 69 57 64 68 69 64 54 57 35 39 63 47 54 58 60 56 43 46 32 35 54 45 49 50 51 48 38 40 27 32 46 39 44 44 44 38 33 35 16 21 36 28 30 31 30 27 21 23 14 17 26 22 23 23 22 19 17 19 11 13 21 15 16 18 16 15 13 14 8 9 14 12 13 14 12 11 10 11 RSJ, odd/20 logistic logistic V = 1, odd/20 V = 4, odd/20 EVANS, odd numbered documents word frequency RSJ, even logistic logistic logistic logistic V = 1, even V = 4, even V = 9, even v = 25, even 63 70 72 69 69 68 69 68 69 66 51 60 66 60 61 59 58 61 61 60 40 47 57 52 52 52 51 51 48 47 35 42 53 49 48 46 46 45 44 43 30 35 44 42 41 40 39 35 37 36 21 24 37 30 32 32 31 29 27 28 14 15 26 19 21 21 22 20 17 16 10 12 20 15 16 17 17 17 14 14 6 8 10 8 8 9 9 10 9 7 RSJ, even/20 logistic logistic V = 1, even/20 V = 4, even/20 31. EVANS-TITLES, even numbered documents Word frequency RSJ, odd logistic logistic V = 1, odd V = 4, odd 42 50 58 54 57 54 54 55 34 35 49 39 45 45 39 44 27 27 38 31 32 33 30 32 21 22 27 24 26 26 24 24 16 20 24 22 23 23 22 22 11 13 15 14 16 15 14 15 9 10 12 11 12 12 11 12 7 7 8 8 8 8 8 8 6 6 7 6 7 7 7 7 RSJ, odd/20 logistic logistic V = 1, odd/20 v = 4, odd/20 EVANS-TITLES, odd numbered documents Word frequency RSJ, even logistic logistic V = 1, even V = 4, even 50 54 65 56 58 59 35 43 58 50 48 52 22 27 43 34 36 33 27 29 18 20 30 23 26 25 20 20 12 14 22 15 17 19 13 14 8 8 14 11 12 12 9 9 5 4 7 6 6 5 5 5 3 3 5 5 4 5 5 4 2 2 3 3 3 3 3 3 RSJ, even/20 logistic logistic V = 1, even/20 V = 4, even/20 54 43 54 44 32. 3.5 Survey of the results 3#5#1 Continued searching experiments This form of evaluation, it should be pointed out, is a new one, proposed as an alternative to the traditional "residual ranking" experiments. In the present experiments, the documents used for feedback are retained in the evaluation set. We suggest that this procedure is more realistic (i.e. simulates a real-life search better) than residual ranking. As might be expected, the continued searching experiments show smaller performance differences between feedback and no feedback than residual ranking. Feedback always improves performance a RSJ weights, always perform little, but often not significantly. better than logistic weights with the same feedback set, though A detailed examination of the again often not significantly. figures suggests that (a) the larger the feedback set, the better the performance at the tail (high recall) end of the curve; (b) a small feedback set seems likely to produce a substantial performance improvement in just a few queries; a larger feedback set gives slighter improvements spread over a larger number of queries (particularly evident in Evans). Some of these points are illustrated in Fig. 1, which gives some results from the Evans collection. 3.5.2 Half-collection experiments It has proved very difficult to generalize from these results. Out of the six sets of experiments, one can find examples where RSJ outperforms logistic and examples of the opposite; examples where higher-variance priors for the logistic improve performance and examples where they depress it. However, in general the results show no strong evidence for logistic weights. 31 l/) -^ O c-J 0 - J *" " 0 < 0 i. o 0 0 o 0 O £ c* ^ S O o ° v4>| 3f O o o vA O o v^5 VA O O O '1 ore. 34. 3,5#3 Other experiments Some further experiments were performed early in the project, as a basis for further work. The results of these however, it is experiments are not presented in detail here; worth indicating some of the main findings. Some experiments were done on the inclusion of interactions : selected pairs of query terms were included as indicated in section 3.3.3. The pairs were selected by inspection, as likely to have meaning as a phrase which is not apparent in the separate words (e.g. high, frequency). All these experiments were negative in the sense that including interactions depressed performance; the more interactions were included, the more performance was depressed. This somewhat strange result lead us to attempt a theoretical answer to the question : When should an interaction be included ? The resulting theoretical development is presented and discussed in section 4. Some experiments were done with the prior for the logistic model having zero mean (rather than collection-frequency based) and large variance - i.e. as neutral a prior distribution as possible. These performed worse than the frequency-based mean. No experiments, however, have been done using other forms of prior, such as constant (non-zero) mean or a peaked function such as Salton's. We attempted to get some direct evidence on the possible ideal form of frequency-based prior. We plotted, for individual query terms, the neutral-prior retrospective logistic weight against the collection-frequency weight (Fig.2). This scatter diagram is notable only for showing virtually no pattern - there is a very slight positive correlation, but certainly no possibility of distinguishing between alternative relationships. We considered the possibility that this lack of pattern was due to variations between queries : that is, that collection-frequency weights might be a good predictor of the relative values of different terms in a 35. given query, but not their values in relation to terms from other queries. So we normalized the weights for each query. The result is given in fig.3; the correlation is a little highter, but still not really good enough to draw any useful conclusions. This area is discussed further in section 5. 36. o s 0 o Ji i 0 O O x r\t x x x x rvi x x X IM X x rvj x x r\j X x X X X X X x CM : X X f\j x *\j rv : rvj K» X X X X IM x X X X rvj »o X X Kt X X rvi K» X fV) X X X X X X X X V"S x x x IM X K» x rvi rvj x x IM % • — ' X *> " X X X X X X m X I M X rw X X X X rsj X rvj r\j X X X X t .«- O rxj «•- CM 00 IM 1-^ «; i/> c H^ •« r > O fO *M t/» rvi X X X X Kl > * IM X X K> X 4 x x xrxjx x >» I M M X rv > » a ficLurz. 2. 37 o 1! j" X X X X X X X X X X X X X MX XX CM X X X X X X fM X fM X X X X f\J X rO >» X X I X X X X X X X K» X X X X X X M M K> X X X X X U x X X I x : X X :x x X fM X X X X X fM X CM X X fM K) M X X K XIMMK X • X rvi O I x x XXX4 X X X X X X fM X X f\j X X fM X X fM X X X fM X X fM X X X r\j X X •sf X o fM •*» X X X X t x x x fM x x fM X X xx X 5^ X X X X o >* • o V fM • M fM fM O 5 • OJ I — «•M o >» f> m 4, • i. <* <• o Wf> ¥• r- K> • 00 » I ~ " 1 Ft Vf