38. 4. Further formal analysis We have already Indicated that the logistic model allows for the introduction of new parameters, which may represent, for example, interactions between pairs of existing query terms or the addition of new terms to the query. A superficial analysis suggests that any additional parameters should be useful in helping to distinguish relevant fron non-relevant documents, or at worst neutral. However, some of our early experiments were conducted on term interactions, with entirely negative results. A deeper analysis indicates, at least in qualitative terms, why this might be. There is only a limited amount of data from which the parameters are to be estimated, and introducing new parameters in effect stretches this data further. Thus all the parameters are less precisely estimated, and this drop in precision more than outweighs (or may do so) the potential This property improvement due to including the new parameters. was indicated previously in van Rijsbergen's (1979) phrase "the curse of dimensionality". Is it possible to quantify this property, so as to derive a rule or rules which would define when a parameter should or should not be added ? in the present section. This question is addressed We have not in fact succeeded in developing an immediately applicable rule; however, we feel that the analysis gives some insight into the problem, and may also provide a formal basis suitable for further work. 4.1 Definitions Suppose that we take repeated samples of n_ documents and the probability of a document in cell j If k.(t) 0 being relevant is constant. after t is the number of relevant documents in cell j > n .p . as £ — 3 * <*> a.s. samples then k.(t)/t with the obvious 0 notation 0 0 39. d -r f (k(t)3tn ; p) — » ^> -r J = 1 n.p.p . - n .log 0 3 0 3 (l+epj) ~~ The value of £ JT - IL.'tn JT yy^ As which maximises the limit above will be demoted T T ^ thought of as a sort of asymptotic bias. It If TTC can be is where we lose out because our model is wrong. then = IT . is made larger the bias will decrease but the JT^ JLvn will increase , increasing its expected value of Suppose we add another vector to m dimension by one. the way the "bias" J _ T vary. 4.2 The "variance" - We would like to try and get estimates of jT_m and the "variance11 £( I JLm " JyJ ) Lv| Let ty (x) be any real valued function and suppose has a maximum at x = 0. say \b(x) = a + a0x + ty(x) Expanding ^ as a power series we have 2 + a7x 2 ZCLJX + 3 + ^ (x) = 2a c,x ^(x) and so tf (x)/ = 2a2 + 6a2x = x + 2 + 0(x ) as x—^0. $" (x) Now suppose is small ^' (x)/^" ty (x) has its maximum at x. Then if \x - £ is a reasonable estimate of x - x . 40. When x_ is a vector variable ^? (x) becomes the vector the matrix of second Hence we might ILyn f. of first order partial derivatives and ^" (x) order derivatives but the argument still works. be able to get some idea of the behaviour of T T ^ by looking at the first and second derivatives of 3f -T*— = % k. - e n. 1 + e % % 1 = k . - n .r. say; with the obvious notation f Also = *-**• -- = j z 3p . p . i!t J n, p . ~ = n.r.(l-r J % = j which we will abreviate as 92f —,L- = ff __ (VJ where £ _£ ^ is the diagonal matrix whose i'th diagonal entry is r .(1 - r.J. Now let ^ be a _ being d x m matrix whose columns are a basis for rfl. (m ). Writing p = A o we have the dimension of (with some abuse of notation) •^ f(A a) = / |£ r£; = / (k - # r) 2 f(A 8o a) = ^ 2 ^ 4 fpj 4 3p = ATN < W j > i 4 41. are in = we and can f As both jiyy> such that by JLyy, which maximises and = JLyyi '^ A £n ind £7 W e &nd ^ ij^ d. £ > —>n £o *H t> the £ J fr\. ~ L f(A_ £ ) . 2 Hence we can / M a))'1 approximate A< - ^ -3 JL / M a; A f/V £ (£ m ; ^; 2 /* (k- I Em) T First we note that A^ # $_ A_ will be non-singular if and only if N_ A_ has rank m but this is implied by our assumptions that N y = 0 and ye 17l=^>y = 0. Now, if instead of the usual Euclidian distance, we use the norm I £1 I .1 = £ N_ £ ( P y^ j )£ we find that *Wfc - iVp_m )TA_ (/N_ ±A) Because of the way p_rr\. 2 AT(k - N^^ ) (1) ) T is defined we have that A (N_ p_ - N_ p_ ^ and so we can replace our estimate (1) by (k - N£)T A(^N_ $_A) Now for each each j j . k. - n.p. 3 3 0 0 we define 2 A1 (k- NpJ (2) 0. If for is a random variable with mean ~k. - n.p J/(n .p .a - p .)r J 3 3 3 3 3 X. 3 _a N(031) n. = 0 3 random variable independent of the other X. n; = 0 -z. 0 42. then for all j k X .(n x> . (1-v .)r = k . - p .n . and we can replace our estimate by Now E(X.X.) and so the expectation of (3) given N will be the trace of the inner matrix. Reorganising it a bit we get E((lm -^ m )TN* (pm )(lm -l m ; \ N) Tr(/l £f£ )A)~1 ^ Tv (ATN_ * (Z m)A) ~2 = (^l R 1 (R)A) The first point about this is that if 2.Yt\ estimate equals case later. then the above m (the dimension of Ml J and each new vector we add increases it by 1. We will come back to the more general There are two main questions about the above estimate. "How good is it ?" and "Can we do better ?". It is possible that it is not too bad. be done. The difference between f'/f and j_ - JT_ might n average out when we take the expectation. Clearly more work could 4.3 The "bias" Let [ t and A_ be defined as before and suppose we increase f to include the vector a. We write n=m & and B = (A ex) We assume that both maxima exist. 43. i.e. f(p) - ^> n o [ PJPJ " log(1 +e ^ } has a finite maximum in I L at T T and in u[ at T [ ^ . Jl/* will be the unique vector satisfying. Tr m eTlfl and / / ' 0* m and so ) = 0_ i s that but / f ^ y w satisying ^ = # ( £ " E JYI ^ ZLttl unique vector (i) (ii) j r m e V/l AT N(£ - £ m ) = £ similarly for j we have r (iii) T y £ 11 Tv (iv) / ff<£ - £ n ) = 0_ Note that ( i ) = > ( i i i ) and (iv) = X i i ) . Now let us suppose we have an arc y_ (t) in (as usual git) = ,~ * d with the following properties TTT" ) 1 +e ^ (1) (2) x <0> = !IYn and ^ (i) = ^n < 1 _ / V (£ - £(t)) = tf, 0 £ 4 (3) x<*>'eYl5 0 i * i 1 T (4) a N^ (£ - git)) , is monotonously decreasing for ( ^ 9 £ < ^ By (3) we can write x ( ) * = £ J (t) L C(t)elR 44. and by (2) 1L K. ( £ " £C*)> * = m+1 j£ A[(£ ~ £(*))JOL where H = dim. differentiating both sides w.r.t. t gives i) d 3t T - SL(t)) a N(£ - ™ , exp B t, (t) H = R N . 1 + exp B_£ (t) BTN ~T7~. B t, (t) .2 (1 B r,'(t) B_N$ (q(t)) but x ' (t) = S _c' j (a. N $ a) 1 n = B( T -1 = (a N $ a) a 45. giving SL BR " Etvi > ^m -iLyv a N <{> a T Hence a ff <) a f If we go back to forumla (4) in the "variance" section and impose the conditions a_ N_ <_ A_ = 0_ and ^(gy* £ T ) ~ •/£ ) we find that the change in the variance going from VYl to Kl is about tfk i (pJ^/(iK i fem.^ Comparing these we get the criterion that we should include a if T 2 T 4.4 An Example It is interesting to look at the case when we have only two terms. Then d = class 0 class 1 class 2 class 3 4, the 4 classes being : all the documents with neither term : the documents with term 1 but not term 2 : the documents with term 2 but not term 1 : the documents with both terms when we add the interaction to has dimension 3. We want to see what happens the independence model. The independence model A suitable basis for to. would be the columns of A where \ 1 1 1 1 1 0 0 \ °\ °) = (Mfcy^ ) A If we assume that there are some documents* in each class ( N is non-singular) then a suitable candidate for a is a where 12 \-l \-l K1) The criterion for the inclusion of a then becomes T —I 2 T -2 If we assume ^ - ^^i is small compared to < we can simplify this f > T 2 > ( £ (& (R ~ £m )) n (p.(l 3 ° 1 - -i pJ)n. The way the left-hand upper and lower bounds. 0 only if all the side varies with N_ needs more investigation but it should be sandwiched between positive The RHS on the other hand tends to n . tend to infinity. The tentative conclusion we might draw from this is that there is no point in including an interaction unless there are a number of documents in each of the four classes.