38.

4.

Further formal analysis

We have already Indicated that the logistic model allows for the introduction of new parameters, which may represent, for example, interactions between pairs of existing query terms or the addition of new terms to the query. A superficial analysis

suggests that any additional parameters should be useful in helping to distinguish relevant fron non-relevant documents, or at worst neutral. However, some of our early experiments

were conducted on term interactions, with entirely negative results.

A deeper analysis indicates, at least in qualitative terms, why this might be. There is only a limited amount of data

from which the parameters are to be estimated, and introducing new parameters in effect stretches this data further. Thus all

the parameters are less precisely estimated, and this drop in precision more than outweighs (or may do so) the potential This property

improvement due to including the new parameters.

was indicated previously in van Rijsbergen's (1979) phrase "the curse of dimensionality".

Is it possible to quantify this property, so as to derive a rule or rules which would define when a parameter should or should not be added ? in the present section. This question is addressed

We have not in fact succeeded in

developing an immediately applicable rule; however, we feel that the analysis gives some insight into the problem, and may also provide a formal basis suitable for further work.

4.1

Definitions

Suppose that we take repeated samples of n_ documents and the probability of a document in cell j If k.(t)
0

being relevant is constant. after t

is the number of relevant documents in cell j > n .p . as £ — 3 * <*> a.s.

samples then k.(t)/t

with the obvious

0
notation

0 0

39. d -r f (k(t)3tn ; p) — » ^> -r J = 1 n.p.p . - n .log 0 3 0 3 (l+epj)

~~

The value of £ JT - IL.'tn JT yy^ As

which maximises the limit above will be demoted T T ^ thought of as a sort of asymptotic bias. It If TTC

can be

is where we lose out because our model is wrong. then = IT .

is made larger the bias will decrease but the JT^ JLvn will increase , increasing its

expected value of

Suppose we add another vector to m dimension by one. the way the "bias" J _ T vary. 4.2 The "variance" -

We would like to try and get estimates of jT_m and the "variance11 £( I JLm " JyJ ) Lv|

Let ty (x) be any real valued function and suppose has a maximum at x = 0. say \b(x) = a + a0x
+

ty(x)

Expanding

^ as a power series we have

2

+ a7x 2
ZCLJX +

3

+

^ (x)

=

2a c,x

^(x) and so tf (x)/

=

2a2

+ 6a2x = x

+ 2 + 0(x ) as x—^0.

$" (x)

Now suppose is small ^' (x)/^"

ty (x)

has its maximum at x.

Then if \x - £

is a reasonable estimate of x - x .

40.

When

x_ is a vector variable

^? (x)

becomes the vector the matrix of second Hence we might ILyn f.

of first order partial derivatives and ^" (x) order derivatives

but the argument still works.

be able to get some idea of the behaviour of T T ^ by looking at the first and second derivatives of

3f -T*— = %

k. -

e n. 1 + e

% %

1

= k . - n .r.

say;

with the obvious notation

f
Also

=

*-**•
-- = j z

3p . p .

i!t

J
n,

p . ~ = n.r.(l-r J % = j

which we will abreviate as

92f —,L-

= ff __ (VJ where £

_£ ^

is the

diagonal matrix

whose i'th diagonal

entry is r .(1 - r.J.

Now let ^ be a _ being

d x m matrix whose columns are a basis for rfl. (m ). Writing p = A o we have

the dimension of

(with some abuse of notation)

•^

f(A a)

=

/

|£

r£; = /

(k - # r)

2 f(A 8o a) = ^

2 ^ 4 fpj 4 3p

= ATN < W j >

i 4

41.
are in = we and can f

As both jiyy> such that by JLyy, which maximises

and
=

JLyyi

'^ A £n

ind £7
W e

&nd

^ ij^

d. £ > —>n

£o

*H t> the £ J fr\. ~ L

f(A_ £ ) .
2

Hence we can / M a))'1

approximate

A< - ^ -3

JL

/ M a;

A f/V £ (£ m ; ^; 2 /* (k- I Em)
T First we note that A^ # $_ A_ will be non-singular if and only if N_ A_ has rank m but this is implied by our assumptions that N y = 0 and ye 17l=^>y = 0.

Now, if instead of the usual Euclidian distance, we use the norm I £1 I .1
=

£

N_ £ ( P y^ j

)£

we

find that

*Wfc - iVp_m )TA_ (/N_ ±A) Because of the way p_rr\.

2

AT(k - N^^ )

(1) )

T is defined we have that A (N_ p_ - N_ p_ ^

and so we can replace our estimate (1) by

(k - N£)T A(^N_ $_A) Now for each each j j . k. - n.p. 3 3 0 0 we define

2

A1 (k-

NpJ

(2) 0. If for

is a random variable with mean

~k. - n.p J/(n .p .a - p .)r J 3 3 3 3 3 X. 3 _a N(031)

n. = 0 3

random variable independent of the other X. n; = 0 -z. 0

42.

then for all

j

k X .(n x> . (1-v .)r

=

k . - p .n .

and we can replace our estimate by

Now E(X.X.)

and so the expectation of (3) given N will be the trace of the inner matrix. Reorganising it a bit we get

E((lm

-^

m

)TN* (pm

)(lm

-l

m

; \ N)

Tr(/l

£f£

)A)~1

^

Tv (ATN_ * (Z m)A) ~2
=

(^l
R

1 (R)A)

The first point about this is that if 2.Yt\ estimate equals case later.

then the above

m (the dimension of Ml J and each new vector

we add increases it by 1. We will come back to the more general

There are two main questions about the above estimate. "How good is it ?" and "Can we do better ?". It is possible that it is not too bad. be done. The difference between f'/f and j_ - JT_ might n average out when we take the expectation. Clearly more work could

4.3

The "bias" Let [ t and A_ be defined as before and suppose we increase f

to include the vector

a. We write

n=m

& <a>

and B =

(A ex)

We assume that both maxima exist.

43.
i.e.

f(p)

-

^>

n

o

[ PJPJ " log(1

+e

^

}

has a finite maximum in I L at T T

and in u[ at T [ ^ . Jl/*

will be the unique vector satisfying.

Tr m eTlfl

and / / ' 0* m
and so

) = 0_
i s that

but / f ^ y w satisying

^

=

# ( £ " E JYI ^

ZLttl

unique vector

(i)
(ii)

j r m e V/l
AT N(£ - £
m

)

= £

similarly for j we have r (iii) T y £ 11 Tv (iv) / ff<£ - £ n ) = 0_

Note that ( i ) = > ( i i i ) and (iv) = X i i ) .

Now

let us suppose we have an arc y_ (t) in (as usual git) =

,~ *

d

with the following properties

TTT" )

1 +e ^

(1) (2)

x <0>

=

!IYn

and

^

(i)

=

^n < 1 _

/ V (£ - £(t))

= tf, 0 £

4

(3)

x<*>'eYl5 0 i

* i

1

T
(4) a N^ (£ - git)) , is monotonously decreasing for ( ^ 9 £ < ^

By (3) we can write x ( ) *

= £ J (t) L

C(t)elR

44.

and by (2) 1L K. ( £ " £C*)> *
=

m+1 j£ A[(£ ~ £(*))JOL
where

H

=

dim.

differentiating both sides

w.r.t.

t gives

i)

d 3t

T - SL(t)) a N(£ -

™ ,

exp B t,

(t)

H

=

R N

.

1 + exp B_£ (t)

BTN

~T7~. B t, (t) .2
(1

B

r,'(t)

B_N$ (q(t))
but x ' (t) = S _c' <t) and so

B s1 (t)

Y'

(t)

= BCEFN £ (glB)'1

n ^

«£fe _ £^*^

Suppose we can consider that the variation in §_ (g) is fairly small compared with its size. Then we can write.

im

_

\

•

±(o) -1 cz; BCA ^ ^ ;B;
£.

2

n J;i(v_ - v_^)

Now i f we choose

a s . t. £ # 1 (£ yn J ^ . =

A A ^ A)
BfF 2 ^ () B ; f
1

1

0
r] = £ ( < / # < ^ > j (a. N $ a)
1

n

= B(

T -1 = (a N $ a) a

45.

giving SL BR " Etvi > ^m -iLyv
a N <{> a

T

Hence

a ff <) a f

If we go back to forumla (4) in the "variance" section and impose the conditions a_ N_ <_ A_ = 0_ and ^(gy* £
T

)

~ •/£ )

we find that

the change in the variance going from VYl to Kl is about

tfk i (pJ^/(iK i fem.^
Comparing these we get the criterion that we should include a if
T 2 T

4.4

An Example It is interesting to look at the case when we have only two

terms.

Then d = class 0 class 1 class 2 class 3

4, the 4 classes being : all the documents with neither term : the documents with term 1 but not term 2 : the documents with term 2 but not term 1 : the documents with both terms when we add the interaction to has dimension 3.

We want to see what happens the independence model.

The independence model

A suitable basis for to. would be the columns of A where

\ 1 1 1

1 1 0 0

\

°\ °)
= (Mfcy^ )

A

If we assume that there are some documents* in each class ( N is non-singular) then a suitable candidate for a is a where

12
\-l \-l

K1)
The criterion for the inclusion of a then becomes T —I 2 T -2

If we assume ^ - ^^i

is small compared to < we can simplify this f >

T

2
>

( £ (& (R ~ £m )) n

(p.(l 3 ° 1

-

-i pJ)n.

The way the left-hand upper and lower bounds. 0 only if all the

side varies with N_ needs more

investigation but it should be sandwiched between positive The RHS on the other hand tends to n . tend to infinity.

The tentative conclusion we might draw from this is that there is no point in including an interaction unless there are a number of documents in each of the four classes.