9. 2. Logistic Models 2.1 General description The object of a probabilistic retrieval model is to provide some method of estimating the probability that a particular document is relevant. The information which is to be used to obtain this estimate is essentially the specific combination of index terms assigned to (or present in) the particular document. Thus from the point of view of the probabilistic model, any two documents with exactly the same set of index term assignments are indistinguishable. The index terms in effect serve to allocate documents to exclusive classes or cells, each all being defined by a particular combination of terms. is the probability j will be relevant. Pj The probability of relevance selected document in cell that a randomly A "pure" probabilistic approach would require past evidence about the relevance of documents from any particular cell, in order to make statements about new documents in that cell. But since most cells are empty and most of the rest contain only one or two documents, this is not in general possible in information retrieval. usual So the procedure has been to devise a model on the basis of some imply Assuming assumptions of independence between index terms, which relationships between the various probabilities pj . such a model means that the estimate of p • for cell j may be based on evidence about the relevance of documents from other cells. For example, in the Robertson/Sparck Jones model, strong independence assumptions are made; as a consequence, the estimate of pj uses evidence from all previously-judged documents containing any one of the terms that define cell j. The model or class of models proposed here actually reverses part of this process. Instead of making independence assumptions and pn-fs, d deriving relationships between the about these relationships. it makes direct assumpumptions 10. Tb© relationships are expressed in terms of the logistic transform of the probabilities, so the class of models might generally be described as "logistic". One particular model in the class, the "independent logistic" model, is roughly equivalent to the Robertson/ Spark Jones model. Dependencies between terms can also be introduced. Further, the framework includes a natural range of estimation methods, as will be seen. 2.2 Mathematical Description 2,2#1 Notation and likelihood function We assume that there are of documents. in class the nfj j, d distinguishable classes (or cells) We also assume that in unit time, the number of documents nj, has a Poisson distribution with mean cj, and that particular are independent. ni It will be seen below that the is not important. assumption about the We also assume that the probability of a document being relevant given that it is in class j is Pj . The number of relevant documents in cell j in unit time is rtj, rij. kj called kj. It is not hard to see that, conditional on and will have a binomial distribution with parameters pj Using vector notation n_ is the with entries nj, hood function for similarly £ and d-dimensional column vector We can write down the likeli- p_t c_, k_ etc. c_ given k and n_ d k. L(-gj oj k, n) = •3 k J (n .-k J ((l-p.)c.) J ° J J (n .-k .) ! J 3 -c e j ! i = i 11. and the log-likelihood d ftfe, £» K n) = ) 3 = 1 - Cj ~ log k.\ - log (rij-kj) ! j [_klog V. + (n^.-kJlog(l-p_.) + n. log a. The logistic models are models of the form (log —— ) e i-p. where m is a vector subspace of K . For brevity we will just talk If we reparametrize by writing P. i • = log — — r J about the model for 1 < j < d 1-P. 3 then the log-likelihood becomes d l( l> £» h n) = ) 3 =1 \_kfj ~ njl°g(1+enJ) + nlog a.- Oj -> - log k.\ - log (n.-k.)\ J 2.2.2 Estimation If we consier the maximum likelihood estimates (m.l.e.) of i and c (assuming r T^ —m I e T p ) we see that the m.l.e. of c is n and does not depend on YY\. . The m.l.e. and is that c of T if it exists is denoted T iTL which maximises v —e d f(p) = f(kc.n^j = / = 7 tn9p) J = ^ { VV " n.log(l+evj) ^J) L &„*.. - njl°9(1+eP (1) (The existence of the m.l.e. is discussed below.) 12. We may easily Introduce Bayesian ideas into this model. general, we multiply the likelihood function by some factor which describes the prior distribution, thus obtaining the posterior distribution. Then instead of a maximum likelihood estimate, we f(g) In may consider a maximum posterior estimate (or posterior mode).. This will introduce an additive factor into the function which must be maximised - an example is given below. There are several possible ways of finding a maximum, i.e. of deriving the maximum likelihood or maximum posterior distribution estimates. It is necessary, however, to use an iterative procedure. 2.2.3 Existence of estimates Considering the maximum likelihood estimates, it is very convenient to define elements are the n.. such that N_ £ = 0. # to be the diagonal matrix where diagonal Now suppose there is a vector a e Then clearly f(p_+v) £ e Tfl s.t. = f(pj N_ £ = for a11 Hence £• . a necessary condition for the existence of an that there is no non-zero this condition on N and p e m.l.e. £. for j_ is r Imposing is strictly will ensure that f(p) concave and hence has at most one finite maximum and that any stationary value of to ensure / is a maximum but it is not sufficient has a maximum in It is unlikely that there is any easily verified condition on HL> K and W* which is equivalent to the existence of an m.l.e. (although some results can be proved). to use, the m.l.e. does not exist. Further, it is likely that The Bayesian prior, however, in many practical situations, for some models which one might want may be used as a mechanism to ensure the existence of an estimate (now a maximum posterior estimate). 2.2.4 Discussion The above descibes a general class of models for the probabilities p . The particular model is expressed as a linear . 13. subspace of the space generated by the logistic transforms I . T J An appropriate way to obtain estimates of the parameters of any model from data would be the maximum likelihood method. this method will not necessarily However, possess solutions at all, We may further introduce particularly where there is little data. prior information into the model by standard Bayesian methods, and maximise the posterior distribution rather than the likelihood. This procedure may be used to force the existence of a solution. Also, the more restricted the model (i.e. the smaller the dimension of Y H >, the more likely it is that the maximum likelihood method will give a solution. In section 2.3 this formalism is used to set up an "independence" model roughly equivalent to the Robertson-Sparck Jones model. We then discuss the possible theoretical advantages and disadvantages of the class of models defined here. Given the form of the class of models, the question naturally arises : under what conditions should one try to relax or restrict the model (i.e. increase or reduce its dimension)? In section 4 we return to this question in theoretical terms. 2.3 Example : the Independent Logistic Model We now consider a specific model which corresponds, in some sense, to the Robertson/Sparck Jones model. Our basic assumption is that probability of relevance will be modelled by a simple sum-of-weights, each weight being associated with one of the matching terms. that That is, we will assume 7 . T = I W, + W J where w to and ( W > are the quantities to be estimated, and the sum t whose presence defines the particular is over all query terms cell j of *d . This model can be simply expressed as a vector subspace , as required. 14. To formulate this model more precisely, we define set of query terms, and particular cell, Q as the T c Q as a set of terms defining a (In fact each "cell" may be subdivided i.e. those documents containing all the terms T Q*T. and none of the terms by the presence or absence of terms other than query terms; however, give this model, such subdivision with the W dummy term t makes no difference to the maximum likelihood estimates derived from (1).) In order to deal weight in the formula for T . we may define a T , is which occurs in all documents, and which Then automatically included in T. is defined as follows : *T= / wt (2) t eT S o , from ( 1 ) , f(w)~ \ L TsQ log (1 + exp( > w ^ > ~ Wt )) t e T t eT w, t eQ k^ T:teT term t. T:teT TcQ n^ log (1 + exp( t e T x-, is the total number of relevant documents assigned to The Ws which maximize f(W) give the maximum likelihood estimates of the IT'S. We may add a Bayesian prior : suppose for example that the W are assumed to be independent, normally distributed with means y, r 2 and common variance a (the means may, for example, be derived from the traditional collection frequency weights). Then the function to be maximised is : w, log(l+e teT teQ T:tzT TcQ ) - 15. 2.4 Comparison of present models with previous approaches We first describe how the Independent Logistic model might be applied in a practical situation, and then discuss its apparent advantages and disadvantages, in comparison with the Robertson/ Sparck Jones model. We also discuss its possible extension to deal with term dependence. It should be pointed out that the comparison between the present and previous approaches is not intended to provide a simple decision between the two. Rather we hope to shed some light on the whole problem, by looking at it from different points of view. 2.4.1 Possible application of the independent logistic model We assume a relevance feedback situation : that is, we assume an initial search which has revealed some relevant documents. want to use this information to estimate the w's (by maximising We expression (3) or (4)), so as to derive an improved ranking of the documents on the next run. The next run may be on the same collection (as in retrospective searching) or a new one (as in SDI). The result of the initial search is a small number (typically) of documents known to be relevant, also a small number (perhaps zero) of documents known to be non-relevant, and the large bulk of the collection of unknown relevance status. Perhaps the obvious way to use this data in a probablistic model would be to use only the documents of known relevance status in each cell in the derivation of estimates. However, it has been common in earlier systems to use the known relevant as such and all the remainder (known-nonrelevant and unknown) as "non-relevant". "complement" This is known as the The procedure (Harper and Van Rijsbergen, 1978). justification for this procedure is that almost all the unknown are certainly non-relevant; the resulting increase in precision of the estimates is likely to far outweigh the slight bias resulting. (Some experiments by Harper and Van Rijsbergen, using another formula, support this view). 16. We assume, then, that the complement procedure is adopted for the Independent logistic model; this has consequences which are discussed below. In that case, the quantities we need in order to maximise expression (3) or (4) are : (a) T:teT documents assigned to each term t. 2 > kT i.e. the total number of known relevant (b) rirpt i.e. the total number of documents, known or unknown, in each individual cell T. As indicated above, the estimates of the w's means of an iterative would be obtained by The particular maximisation algorithm. procedure used in our experiments is described in section 3.3. 2.4.2 Main points of comparison We may indicate the main differences betweeen the independent logistic model and the Robertson/Sparck Jones model : (a) The new model requires a litte more data, namely the individual ft™ values (as opposed to the total for each term) (b) The new model uses an iterative algorithm for obtaining The effect the estimates, rather than a simple formula. of this will depend very much on the number of query terms | Q I , since the procedure requires manipulation of | Q | x | Q \ matrix. a This it may affect computation time insignificantly for a 4-term query, but require a totally unrealistic amount of time for a 40-term SDI profile. The methods used and the resources required in the present experiments are discussed in section 3. 17. The independence assumptions of the new model are considerably weaker than those of the Robertson/Sparck Jones model. The definition of TfL given by expression (2) is equivalent to assuming, in some sense, that the degree of dependence of any set of terms is the same in the non-relevant set as it is in the relevant set (Robertson/ Sparck Jones assume no dependence in either set). One major advantage claimed for the logistic models in the medical context is that the estimation process does not involve assuming that the sample on which the estimates are based is random : it is only necessary to assume that the items in the sample from a given cell are a random sample of items in that cell. This property should confer a considerable advantage on the logistic models, since the estimation sample is drawn on the basis of cell membership. However, it appears that this advantage is nullified by the use of the Complement method for nonrelevant documents (for which there is no equivalent in the medical application). The major reason for the choice of the maximum likelihood method of estimating parameters is its mathematical tractability. Maximum likelihood estimates do have a number of desirable properties, but these properties tend to be asymptotic : that is they only hold absolutely for large samples. The estimation method used in the Robertson/ Sparck Jones model (i.e. the "#.5" formula, expression (1) in section 1.2) was chosen on the basis of a specific property of unbiasedness; this is also an asymptotic property, but the small-sample error is not too great, -2 -1 since it varies with n rather than n for sample size n. The Bayesian prior is a very much more flexible device than the equivalent in the Robertson/Sparck Jones method, 18. namely the weights. "+ 0.5" in the formula for calculating Different choices of the means vu reflect different assumptions about possible prior indications 2 of term value; different choices of the variance a reflect different assumptions about the reliance to be placed on these prior indications. Some experiments with different values are discussed in section 3. (g) Finally, the model itself is more flexible. term-pair Thus we can consider the addition of a new term, or of a (to allow for term-dependence not conforming . (Adding a term-pair to the independence assumptions) by simply adding an appropriate dimension to j7[ as a new component of m does not conflict with the independence assumptions discussed in (c) above, since A & B—>-4 applies to both the relevant and the nonrelevant set) . Thus we see that the logistic model, or class of models, differs from previous approaches in a number of interacting theoretical and practical ways. Section 3 is devoted to a series of experiments with various versions of the logistic model on two test collections.