XV-1

Query Clustering in a Large Document Space S. Worona

Abstract The Cranfield 424 document collection is clustered using queries and known relevance judgments. This clustering method is compared to a

full search of che collection, and several searches using a standard clustering technique. to the experiment. Several new evaluation parameters are defined and applied

1.

Introduction One of the most important aspects of any information retrieval sys-

tem is time — h o w quickly a user's request can be processed, the specified information generated, and the output returned to the user. This is espe-

cially true in a real-time system, where the optimum time is measured in seconds. For a large-sized document collection, search-time — t h e time

spent scanning and correlating against the members of the collection — is critical, since it can become excessive, often varying with the size of the collection. Because of this, various techniques have been developed to shor"Batching", that is, searching the document collection only

ten search-time.

once for several queries, has proven effective in reducing per-query searchtime. This must be considered unworkable, however, in a real-time system, "Clustering" techniques, which use

when only single queries are available.

one "centroid" to represent many documents, also lower search time, and are, in addition, well suited for single-query real-time systems. Clustering

is the operation which consists in dividing a document space into several

XV-2

groups, each of which is considered as a unit.

Each cluster is represented On the sur-

by a centroid, similar in form to the documents it represents.

face, then, a collection of centroids is no different than a normal document collection. All clustering operations can be divided into two parts. The first

controls how the clusters are to be generated from the document collectior. and how centroids are to be assigned to the clusters. The second determines

a search-scheme by which the collection of centroids is scanned and certain clusters chosen for expansion. In addition, the final ranking of retrieved

documents and the subsequent use of relevance feedback techniques [2,8] may become part of a clustering system. These last considerations, however, are

not peculiar to clustering, and are not taken up in this report,

2.

Generating Clusters Several methods of generating document clusters are currently being;

sed in experimental systems, among which are those developed by Bonner, Rocchio, and Dattola. [3,4,5,6,7] Most of these make use of correlations

between the documents to be clustered, grouping those which correlate highest, and then forming each cluster centroid from the concept vectors of the documents included in that cluster. Thus, these techniques produce clusters of

documents whose concept vectors are highly related to each other, each cluster being represented by another vector which is a mathematical combination of the documents it represents. Parameters for these clustering routines

include the number of clusters desired, the number of loose documents permitted, the level of correlation between cluster members, and the degree of "overlap" of the clusters.

XV-3

In [1], V. R. Lesser suggests that a different clustering method be used. His method is a two-pass algorithm, consisting of the following steps:

First all queries previously processed by a system are clustered by a standard method. The resulting query-clusters are used to cluster the document

collection in one of three ways:

1.

All documents correlating highly with the centroid of a query cluster form a cluster.

2.

All documents correlating highly with one or more queries of any one query-cluster form a cluster.

3.

All documents judged relevant to one or more queries of any one query-cluster form a cluster.

The centroids of the resulting document clusters are the centroids of the corresponding query-clusters. In this way, each document is represented in Accor-

its cluster by a centroid formed from queries rather than documents.

ding to Lesser, this process is effective since incoming queries are more likely to be similar to past queries than to documents. Thus, Lesser be-

lieves, new queries are less likely to fall between query clusters than between document clusters. CSee Fig. l)

In addition to this property, the query clustering method, especcially when performed with relevance judgments, may enable a retrieval system to "mature" as more and more queries are entering Into the system. As in

all clustering schemes, updating the clusters would be periodical, depending on both the number of queries processed, and the number of new documents received.

Document Query Standard document cluster Query cluster with associated documents "New" query falling within query cluster, but between document clusters

Query-Document Space CFrom [ 1 ] \

Fig. 1

XV-5

3.

Searching Clustered Collections In general, collections of centroids can themselves be clustered to

form "super-centroids", etc.

With each new clustering, another "level" is Only simple centroids of level

added to the degree of the required search. two are considered here.

When searching such centroids, one parameter is crucial — the number of clusters to be expanded. Of course, numerous other considerations are also

important, including the method of determining the "goodness" of the centroids searched. These are, however, superseded in importance by the former, which If

controls the portion of the collection that is to be used in the search.

this portion is too large, the search is likely to be successful, but the resulting saving in search time may be Insignificant. On the other hand, taking

too small a piece of the document collection may produce poor, although rapidly obtained results.

M. -

Parameters for Evaluating Cluster Searches As in any search attempt, It is important to determine the recall and

precision of a clustered search.

However, other considerations also enter Perhaps the

the picture as full searches are replaced by centroid matches.

most important, and possibly the most difficult to measure, is the amount of savings in machine-time offered by the centroid search. All other values used

to decide the effectiveness of a search must be considered in combination with the statistics of how much time is saved. In this paper, no attempt is made Rather, all

to combine such time considerations with any other parameters. parameters are presented separately.

This is done because no acceptable method Indeed, the desired re-

of combining these parameters has been decided upon.

XV-6

suits may vary with the application:

given the decision of whether a search

retrieving 45% of all relevant documents while scanning 45% of a collection is better or worse than one retrieving 30% while using only 30% of the collection, different users would undoubtedly give different answers. The factor used in this paper to measure time savings is correlation percentage, the ratio of the number of documents and centroids scanned to the number of documents in the collection. This will, in most applications,

be a number between 0 and 1, with a full search always evaluated at 1. Given any particular query, it is reasonable to ask how different cluster-generating procedures rate as creators of good "targets" for a search. For example, a scheme generating clusters, none of which contain a large number of the relevant documents for that query, will yield poor results no matter what the search technique, because several clusters must then be expanded before all the relevant documents are retrieved, thus destroying th^ effectiveness of clustering. It is then necessary to examine the "target For a given query, the "target clus-

value" of the tested clustering schemes. ters" are those n

clusters which, between them, contain the largest number n is the number of clusters to be expanded. _

of relevant documents, where

Given two clusters with equal numbers of relevant documents, the smaller is chosen. When more than 1 cluster is to be expanded, the target clusters are

those which have the smallest total of Cdifferent) documents, while still containing the most relevant possible. The target value of a clustering

scheme for a particular query is the ratio of the number of relevant documents in the target clusters to the number of relevant for that query. The ideal system is one in which the target value for all queries is 1, and the correlation percentage is minimized. This alone, however, will

XV-7

not assure good results.

After ideal clusters have been formed for each query, The "aim" is then deThe

it is necessary that they correlate in the proper way.

fined as a measure of how well a centroid was assigned to each cluster. "aim clusters" of a given query are those ing a search. n

clusters which are expanded dur-

As with "target value", the "aim value" is the ratio of rele-

vant documents in the aim clusters to the total number of relevant for the query. This should not be confused with the "recall ceiling", a similar conCThe recall ceiling does not

cept, but one which yields different results.

take into account relevant documents dropped because they did not correlate highly enough with the query.) Although this paper does not deal with a wide enough range of experimental data to make full use of aim and target values, these concepts make it possible to separate judgments on clustering from those on centroid assignment, and may be valuable in an in-depth study of clustering techniques. Perfect values for aim and target should do much to optimize a search scheme, and when combined with low correlation percentages may be even more effective. One more consideration is important, however. Take, for example,

a collection of clusters, all of which contain all the relevant documents for a particular query. Another set of clusters may contain only one cluster inQuite conceivably, aim, target, and correlation

cluding all such documents.

percentage values may be identical for the two schemes on the given query, yet, the two schemes may be quite different. The former may have a great The term

deal of "wasted" documents where they are not needed by the query.

"rejection" is used to refer to the tendency of a clustering scheme to "reject" relevant documents from all but the target cluster(s) of a given query. It is

defined as the ratio of occurrences (not necessarily different) of relevant documents in the target clusters to occurrences (not necessarily different) of

XV-8

relevant documents throughout the clustered collection. 1 is optimal.

Again, a value of

5.

The Experiment Lesser1s attempt to demonstrate the effectiveness of query clustering

yielded encouraging results.

The limitations of the experiment, however, Since the most damaging of these

put the results on a less-than-solid basis.

limitations was the small size of the collection used, (only 35 queries and 8 2 documents), it was decided that an experiment on a larger collection was in order. In the present experiment, the Cranfield 424 collection, contairAs in LesserTs approach, the

ing 424 documents and 155 queries, is used.

procedure is in two phases — first query clusters are formed, and then document clusters are generated from these. Unlike Lesser, who associated docu-

ments in a cluster if they correlated highly with one or more queries in ar.y one query cluster, the current experiment uses relevance judgments to form document clusters. The 155 queries are split into two groups, one of 130 and one of 25, by choosing every sixth query for the smaller group. (This process is used

because the collection is arranged in order of subject area, so that taking, any continuous subgroup would destroy generality.) The 130 queries were

clustered using DattolaTs clustering algorithm [7], producing 11 clusters with an overlap of 13.9%. with an average of 28. Clusters range in size from 17 to 37 queries,

Queries in this collection have from 3 to 22 releDocument clusters are then formed by replacing

vant documents, averaging 6^.

the list of queries with a list of relevant documents for each cluster. Since this experiment is being done using Cornell University's SMART system,

XV-9

each centrold is easily associated with a different collection.

Both docu-

ments and queries are generally specified by a four-digit integer, and both have the same general appearance. It is thus possible to use documents and

queries interchangeably in almost all applications. The resulting document collection is described below (Table l ) . Overlap was not calculated for this collection, although it is estimated to be about twenty to thirty percent. Statistics are available giving the

number of times a document appears in a given number of collections Cfor example, only 102 out of 424 documents appear in exactly 1 cluster), from which the overlap is estimated. It is interesting to note that a collection of query clusters with an overlap of only 14% is turned into document clusters with an overlap nearly twice as high. The reasons for this include the fact that many docu-

ments are relevant to a great many queries, and that sets of co-relevant documents are common. In a clustering algorithm, the question of "loose documents'' must be considered. Loose documents are those which, at some point in the clusIf such documents are not "blended

tering procedure, belong to no cluster.

in" In one way or another, subsequent queries are likely to have artificially low recall ceilings. After associating all of the relevant documents

with the queries of the initial query collection, it is found that some 29 documents remain loose. Fifteen of these documents are found to be relevant

to one or more of the 25 test queries, so these documents can be blended in. This is done by correlating all 15 documents with all 11 centroids, and including each document in the two clusters, whose centroids are closest to the documents; in addition, each document is also included in any cluster with

XV-10

whose centroid it correlates by .1500 or higher.

The figures of two clus-

ters and .1500 are chosen to maintain the characteristic overlap of the collection at its original level, and are, for the most part, a product of intuition. Since a clustered-search is inherently different from a full search, it is desirable that other clustering methods be used for comparison. Thus,

the Cranfield 424 document collection was itself clustered using Dattola!s algorithm. The results of this operation appear in Table 1. Notice, in par-

ticular, the great difference in the number of concepts appearing in an average cluster for the two cluster schemes. This points up the fact that Dat-

tolaTs algorithm produces clusters with document-related centroids, while query-clustering techniques produce centroids resembling queries rather than documents. Four test searches are made, each with the same initial parameters: All documents correlating greater than 0 are considered; all other values are set at default conditions. One full search is done, one clustered search

using clusters generated by query-clustering, and two clustered searches using Dattola's algorithm to generate clusters. The first of these two calls,

for one cluster only to be expanded for each query, while the second calls for two. (A trial was made on which three clusters were to be expanded for

each query, but this run failed because insufficient space was available or the program disc storage unit.) Complete statistics are available Conclu-

ding aim, target, and rejection values — see Appendix A — where applicable) for the full search, query-clustered search, and the first of the two normally-clustered searches. Statistics for the remaining clustered-search are (See Table 2.)

limited to recall and precision values.

XV-11

\ v

Clustering ^ S v \Method

j Parameter ^\^^ ^ . ^
Number of clusters Number of documents In largest cluster Percent of collection in largest cluster Number of documents in smallest cluster Percent of collection in smallest cluster Number of documents in average cluster Percent of collection in average cluster Percent overlap of clusters Number of concepts in average clusters (

Document Clusters Generated by Dattolars Algorithm

Document Clusters Generated using Query-Clusters and Relevance Judgments

21 124

11 160

29

38

25

52

6

12

81

119

19.

28

18.5

Csee text)

374

127

j

Statistics of Clustered Collections

Table 1

XV-12

6.

Results As the graph In Appendix B indicates, the query-clustered search re-

sults in recall/precision values rivalling a full search, and surpassing it at one point, up to a recall level of .4000. The search with normal clusters

setting n=2 passes the query-cluster graph at recall .3000 and remains close to the full search graph from that point on. The standard clusters with n=l

generate values quite a bit lower than the others. A preliminary observation is that these results follow directly the correlation percentages of Table 2: on the Appendix B graph. The higher the CP, the better the resalts

Of course, this relationship is not linear, as the

full search is only slightly better than both the query-cluster search and the standard-cluster search with n=2; the full search has however a CP nearly three times the size of the others. here. It is suggested that, with "good" enough clusters and centroids, a clustered search need not loose a great deal of the recall compared with a full search. Notice in Appendix A that the n=l normal-cluster search has Thus, Obviously, other factors are involved

a very low aim value, completely cancelling out the high target value.

although for most queries there is a cluster which "suits" it very well, that cluster is seldom found in the search. tion of the centroid. The problem might be in the construc-

On the other hand, the query-clustered documents main-

tain both high aim and target values, and achieve markedly better results. Of course, these differences are not independent of the correlation percentage. Yet, it is a matter of question whether document clusters may be con-

structed with high aim and target values, and at the same time low correlation percentages. For several queries, it appears that normal document clus-

XV-13

^sParameter Search ^^\^^ Method ^\^^ Full Search Clustered Search using query clusters Clustered Search using Dattola's algorithm. n=l Clustered Search using Dattola's algorithm, u-2

Normalized Recall

Normalized Precision

Rank Recall

Log Precision

Average Correlation Percentage 100 31.2

0.8258 0.5538

0.5968 0.4500

0.1920 0.0621

0.4327 0.3328

0.3378

0.3040

0.0179

0.2712

21.1

0.6072

0.4893

0.1034

0.3665

38.8

Results of Four Searches n = number of clusters expanded

Table 2

tering is inferior to query-clustering, even with similar correlation percentages. (Queries 6,8,20,24,25.) On the other hand, other queries show the (Queries 7,9,22.) Additional results are needed, particu-

opposite trend.

larly of query-clustering methods generating relatively small clusters. Until such tests are carried out, the present results must remain inconclusive.

XV-15

References

[1]

V. R. Lesser, A Modified Two-Level Search Algorithm Using Request Clustering, Report No. ISR-11 to the National Science Foundation, Section VII, Department of Computer Science, Cornell University, June 1966. W. Riddle, T. Horwitz, and R. Dietz, Relevance Feedback in an Information Retrieval System, Report No. ISR-11 to the National Science Foundation, Section VI, Department of Computer Science, Cornell University, June 1966. J. D. Broffitt, H. L. Morgan, and J. V. Soden, On Some Clustering Techniques for Information Retrieval, Report No. ISR-11 to the National Science Foundation, Section IX, Department of Computer Science, Cornell University, June 1966. J. J. Rocchio, Jr., Document Retrieval Systems —Optimization and Evaluation, Report No. ISR-10 to the National Science Foundation, Harvard University Doctoral Thesis, March 1966. G. Salton, Search Strategy and the Optimization of Retrieval Effectiveness, Report No. ISR-12 to the National Science Foundation, Section V, Department of Computer Science, Cornell University, June 1967. R. T. Grauer and M. Messier, An Evaluation of RocchioTs Clustering Algorithm, Report NO. ISR-12 to the National Science Foundation, Section VI, Department of Computer Science, Cornell University, June 1967. R. T. Dattola, A Fast Algorithm for Automatic Classification, Report No. ISR-14 to the National Science Foundation, Section V, Department of Computer Science, Cornell University, October 1968. E. Ide, New Experiments in Relevance Feedback, Report No. ISR-iu to the National Science Foundation, Section VIII, Department of Computer Science, Cornell University, October 1968.

[2]

[3]

[4]

[5]

[6]

[7]

[8]

XV-17

Appendix A Aim, Target, and Rejection Values, by Query

XV-18

C

o I -H

"^
<->

1 1 i

L° ^D rorH O

O)
(I)

CD

CO
(1 i

CO t CN
C )

CN CO
CN
r-H

K

>

O CD - 1 C)

o o o

CN C )

CN LO CO CN O

- 1

4"
CJ) ( M

o o
LO ( J C )

- 1

O

C >

l> LO CO

o
LO CN

o
C J

o
r \

o o o
LO

Rej

CD

i

o

1 )

o

o

o

a

i |

O P>

-P (D b0 rd

0

i
1

e p.
-H

•H
•1
;

rd
P<

o o o o
H

o o o
.

C > C )

C )

C ) LO
[>

C 5

o

O

o
1

o o
H

o

o o
LO [> C 5

LO CN

o o

o o o

CO

o
CO CD
c 5 r I

o
CD
LO

c :>
r 1

o
C )

c >

()

O
C )

ro o o o
CD

o a
C )

o o o o
rH

I <

H

rH

o o o o
LO i

o o o

1

CD

O

o
C >

c
O O O rH

3

1 6
i

, < >
00 C D P 00 rH

«H

rd

H

o o o
rH

o rC i

o o o o
-

O

c >
CO

C ) CO

CO

o

LO CN

<::;:
r I

o

o o o

CD CD

c >

CO 00

o o o
CJ

o o o

O
rH

()
CN J" - 1 CO

o
O)

c >

O

o

k

i I

1 C D 0 P. - H P. P>

—r G c
(l.)

• ; *

o
p
•

3

O

a) o rd PL,
H

C D bO rd
^

^
CN

CO 5 t

CO

3-

CD

CD
C )

CD

CD

CD

CD O zt

en

o t

.\ CO

CN

a>

o
CO

o o

CO

* CO

O J"

CD D

•H

<
Lp

-p

a

•r 1

No. o Relev Docs. Clust

rd

P. CD

CO

O

o
. -1

CO H

CO

H

LO

CO

LO

(0

O

zt

00

p C D C D bO D" P. r H rd rd H >

o o o o
H
i

o

C )

o

c . »

[

-

CD

O

t>

o

c ) rH

o

o o
H

CO CO 'O

()
c :>
< i)

o

o

C)

o
C D C ) c )

o• O

o o
H

CD CO
c )

CO CO CO 00 CD

o a

00

'>

o o o
O . H

O

o a a
rH

^ C D
• ^

o

')
CD

C )

H

a
H
LO
CD

W

o

00

3 H O
• ^

1 C D P. P. O O

C O -H P rd H

G

:1
r- I

<)

a)

•% C D
b0 rd -P

CM

CO
C >

CO

CN
CD

:.t
LO

cr>
J-

CO
C 1

.;{"

CD

o
. -1
CN

CO

(D

P

C D b0 rd H

a) a*

cn
CN

O

t

CN

r co

co

co

o

rH CO

r
CO

o
zt

>
,r.
p
ci) 00 Q)

^
4H

No. o Relev Docs. Clust

P G G •. i rd

P C D

cp CO
[>

o
1 1

O r i

i

N-

:i

LO

C 'J

LO

1-

(O

zt

O

P

P.

rd

CO

o
. 1

O
r 1

o
H

CO

LO

-:t

LO

CO

(D

LO

CD

zt

P.

Quer Numb

^D C D

H

CN

CO

1

\t)

CO

\>

'()

CD

o

CN

H

H

r "I

CO rH

CO

text: for e; xplana' tion o
a)
CD

Numbe Relev Docum ents

XV-1 ( J

ion

CD

00 CO CM

C )

[

CD CM lO CO C )

(

I
I

4-'

I

00
CM

o

()
c: »

c >

CM
I

c ) C )

CD # l—)

It

(> LO

- •

(SJ

o
LO

o o o
CD

. t m CO CO C ;

o
lO '. 1 CD

t C-r 1 CSJ

o

< >

H

(O CO <u CO

cn
[ r\j

CD LO

CD P^

o

O

(»

o

O

o

()

o

o

O •P

-P (D

O

bO T H

CO i_n

c*o
00 00 00

o
c )
C 3

O

o
C ) c > LO

C )

C 3

C > C >

c >

C )

e u -P
1 <
•H

Li")
LO

h

fU

(^

rd

o
1 1

o o o
C -)

< ) (.) o
c >

o
o

o
H

o
( 3

o

O
CN C )

o
C ) -i

o o
o

()

o o o o
1 -1

CO

cn
[CD

1 °
1 |
<£> LO
LO LO

o
00 Cr ) 00 00

()
00

H

o o o
LO

. i

o o o
CM CD

3
i <

C D

o
C ') C )

C D c :>

(0
CM t

i e H
> 1 1 i C D O k *H ^ -P C D O ^
00 CD

o
. 1

o

o

o o o

o o
C )

o
C)
U )

< )
O
c )

o
C )

< -)
c )

! •
C )

o
rH

CM o

o o

o
C ) . 1

o a
CO

CM O

o
rH

o

o

•P
00

_.,
*CD

o
H
CN

(})

•

!

cn
o
CO

(D

o
C M

rH -] CM

0 LO

o
CM

O)

cn
-i - H

o
- 1
CM

CM rH CO

3
rH

O O

bO rd C D fU H PM P

C ) 00

=t

( \|

o
-f

H

co

H

~t
CO

o

<

of evant s . in Clu:
P.
CD -P r/i

LO

c >
r 1

(D

o

CO

o

CO

• O ^

rH C D D^

O O Q

t < 1

H

C J

r -

CM

H

1 1
o
CM

•P

C C D D bO 3
00 p CD -P

o o o o
H
1

o
C }

C )

c .:>

1 i
[> LO 00

o
C )

C)

C >

O
• )

o
LO CM (v)

o

O

o o
H

O O

o
H

o o <o o

o o
-1

o
LO

c >

r>
o

o

< ) C) (:>
H

< ) < >
c > - i

o o
o
- -1

cn
CO

E> l

>
C C
C D O

o

. I

o

o
.

1

W C D
:i i
fii

.

o o 3
rH

C D O (n - H

get

°
^

P. -P U O n) CD 03 O H P_, -P

^ * '& bO

zf H
CO

cn
O

- 1

CJ>

C M
( ?
t

-]-'
LO CO

rH

t LO CO

H
zf
CM

C ) H ( 1

cn
r

CM

rH

H

t
C vl

t CO

d"
CM

-.!" \

CM

< •,»

H
CO

>
0)
to
CD
. i :

I

o

>
0

CD

(\l

(I)

CO

CD

CO

C)

zt
r- 1

m

if )

[>

( M

C O

00

,

H

H

• H O 3 O C O H D S P4 Q O

1

O

-P

-P C D S 3 O O Q
CD CM CD LO [> CO

Quer Numb

P. >^ C D

. \

-\

LO • 1

CD , 1

O
• I

00

CD

o
C J

r 1

- 1

rH C s|

CN CM

CO CN

tC'-J

LO CM

p CD

'ages > <

CO

text for
0)
<})

P rti C > D XI C D B r-\ 3 0 S P^

J-

H

J - 1

LO

a)

r-

C ^

- 1

1

xplar tion c
Hi

1 H

i

nJ

,

-P C C -H MH fd

p

C D • -P

P

1i n

(\)

XV-20

•H -H O
CD
•i—i

o

rCD CO rH

<D (I) CN 1 O

c )
I D

CO t

( 0
CO

cn

cn
CO

t

( J rH

. i . i
i •

r- •
rH

o
CO . 1

.\
o

CO CO 00 CO CI)

o o
C ) CN

O
I D

CO

O

cn
CO 00

CN to

o o o

CO CO CO CO

. t-

CD P^

o

<'

()

( >

o

< )

o

O P •H <

-P 0) 0 bO • - 1
}'

O O O O

rH i 1 - 1 . 4

O

e 03 ^
H

rd
(.1-;

o o

C ) If)

<)
1 1

r00

C 3 ( M

()

C) C)

c >

o
IX) C )

o
1 )

o
H

o o o o o o o

<>

o o o
C ) . 1

o o cn o
r 1

o o o o
H

o o

CD

LO

o o o o o

o
C D 3 •H

( »
C J C 5

<'
r>
CO

< >

e rd H
i

o o o o o

c >
O O r 1 3 C )

c :>
K )

c :>

<o
. -1

o o
< 1 -1

o o
LO [ -

CO C0 0 1 CO

CO

o
C .)

cn
CO 00

CO CO CO CO

(')
i >

C )

o o
H

o o o o o

o

()
r• 1

o

o

<:.)
r-LO CO J

o
rin CO

o

o

!

c/]

\

0 •P 3 H

^

CO

i C C O D ^ -H ^ P O rd O H

r. < D
a n

• ; $

0) bO 0) rd P-H P

m »
CO rH

O t - 1

[> cj) CN

t CO

<D

rC

r
zJr~\

o
rH

Li") [> rH

LO [> rH

c -J
CO

.\

en

d-

•H

<
4n O

Nc . Rele Docs Clus

P C rd >

P.
CD

P

o

H

ID

CO

H

CM

CO

o

rH

LO

LO

'

i

o

r
P
03 CD

w
CD

bO 3 k rH rd rd H >
i

o o o o
rH

O

O C )

o

o

en

o

LI )

o o o
00

CO CO CO 00

o
1 )

o o

o
00 ' J

LO

< )

o o
C") 00

CO CO CO CO C )

CO CO CO (X) C )

o o o

C )

r-

<D CD

<o
O

o o o o

LO

k

- >
00

c >

')
CM 1

()
<D

<)
<D
O)

«>
r 1 [--

1 1

I

,

H w 3
rH

, ,

1 < D
^

C O
-H

C
(i)
• ; *

(/) 0) :y- 1
rrj

O

+1 "
bO rd E-

U O
O

C U & bO a) rd r H 04 P P rd

a

c*1
00 rH

r>
c si

00 00 CM

en
H

r>
t

i-

[^ LO CD

CO

o
' J

cn

r 0

cn

-1

r 1

o

H

H

!

CM

=t

>
(1)

C O

a)
, ' • :

^
i
4H

P

G
rd

r:
•H

I >

No. o Relev Docs . Clust

i

°°

en

LO

CO

LO

--J

CO

]

rH

LO

1! )

:t

CM

4H

i

O

P

c

,

U

rd

1

00

o

C i 1 1

o H

<D

LO

1-

LO

CO

(D

LO

CD

d-

Quer Numb

i

>.

(i)

I

rH

CM

co

- -1-

in

ID

[>

(J)

cn

o
H

. 1

CM H

co
rH
(1) CO

xt for explana tion o
a)

& CD

M |

Numbe Relev Docum ents

\ XV-21

£ o
1 1 1
i

«H -P C D

01
CM
LO

H"

C ) ( i C )

t cO CO rH

CM
r l C J , -1

-0 CO CO
r 1

LO f -

LO

00 . 1

[> LO 00
CO

rp 00 r

C )

C 0

O
if)

o
C )

O to

CO
C Sj CO . ]

:|I

CM
c >

CM

.1 1
C )

-1 (O (D CO

>" ' ' —
CD

o

o

O

CO

o

()

O

o

o

()

()

!

o
P

p
CD bO O «H

e ^ -p
•H < rd H rd (^

o o o o

(')
C i C i

[

C )

UD
CO CO

LO
(-J

o

C)

c >

o
r 1

o o o
c )

<D

(00
CX)

C '

O
CO) ID

CM 1

o o o
r 1

CO CO
O) CO

o
c )

CO)

(C)

o
C3 rH

c ) c )

d"

LO

LO LO O

CM
1 r r
\

I °
1

O

CO

r 1

')
C )

CO

o

1 1

C D

3
•H

e H rd < >
1 G (D O
P, - H
P. O P rd H

1 I

o

°

°

o o
C (D ^
O
U CD
P-4

CO CO 00 LO C)

CO CO CO CO

CO

o
CO

CD 00 CM 1

CO
C ) C D

CO LO CM

o

CO
-

)•

C ) c ) c ) 00

o o
If) ( J

CO
:t r 1 1

O

— —— — - . LO
oo
U3 CO
O

C D
•• 1

o
If)

r-

\

^
ci)

CM
C >

[ • c >

(y

o

o o
CO

o
oLO CO

o o
LO en

o

C)

c >

c>

00

i

^ C D
1 -ti
1

en o
rH

LO
r ••

LI ) L>

i.O

O) O
r 1

o:
LO
[-•

B
, I II
n )

01 CO

LO P . 1

Zt CO

rH rH CN

C D
W) dj P rH

i ' •r !

C/D

1
H

3 O

. 1

O

. \

00 CM

. 1

. 1

CM

u 0
CO

1

<

P> C <P rd

£ «H

p. CD

No . o Relev Docs. Clust

o

P

CM

H

-:> ,

C )

rH

f O

zt

CM

LO

0)

1

y Datto

1

td

p CD P.

CO CD rH

O

CO CO CO

bO 3
00
CD -P

rd rd F-> >

o o o
rH

LO

o o o

c:.)
t D

CO
00 c )

LO c >

LO 00 CM M-

O

C )

c )

C ) C ) 00 c )

o o

()
O
1C)

c >
. I

o o
Lf)

')

( .) ; LO

o

CO

o. 1 [>

LO

< ) o

[>
rH CM

. i c )

•a
G)

ro

ro-

^ M »
00

'I I O

^
I
P. O

o

o

o
H
l>

o
LO OO
c M

o

•
r

I '
<7)

W

C
P rd

C
CD -JC Q CD P. CD b0 rd

i

CD

H
\

C D CO

i 3 1 H o

CD O P, - H

rLO 00

If)

<o
CN CO

o
H
CM

o
r 1 ' J

co
CN CM

CO CO CO

LO

LO

t

JCM CN

d
. 1

c )
00
1 >

rd r> -1 rH
CO
( si

!

"

•

hes

i

^

bO

rd H

^

p
Lp O • O ^

C -H
• 00 O O Q

n

p.
CD P 00 3 rH O

i >

CD

r-

CO

t

CO

CO

co

[>

d-

' i)

LO

(0)

1

t for explan ation o

rd > CD rH CD p^

1

C i

r

-^ — /

Numbe r of Relev ant Docum ents

en

lO

LO

[>

(O

|-

H -1

LO

CO

r-

C s|

. 1

1
00 CD bO

X
CD -P
ci)

>o P. CD

CD rQ £ 1 1 i

H

-1

• 1

rH

1 1

CO

CM

CM

CM

Avera

LO

- 0

[--

00 rH

(-O

CO

. 1

CM CM

en CM

o

if)

cy

^

<\)
CO

ume

! +- 1

r 1

>
C D

a H ^

o

H

a, -p

n

o o
i

sters exp<
M

I B i *H

<

r 1

C D
n ) ni

LD T

(:

!-.

i

XV-.1.

Appendix B

PRECISION 0.7000 0 . 6 0 0 0 -A 0.50000.4000H ^ 0.3000^ 0.2000H
o. IOOOH
^

A = Full Search • = Clustered Search using Query Clusters O = Clustered Search using Dattola's Algorithm (n=l) • = Clustered Search using Dattola's Algorithm (n=l)

\

.

N*

(n is the number of clusters expanded for each query) -A

I

0.00000.0

± _ L _ L ± RECALL T T T 0.2000 0.4000 0.6000 0.8000 1.0000 0.1000 0.3000 0.5000 0.7000 0.9000 T Recall-Level Averages for Clustered Search