XIV-1

The Single Pass Clustering Method S. Rieber and V. P. Marathe

Abstract In information retrieval, several complex clustering methods exist which require extensive processing time and computer memory. A cheap clus-

tering method is developed which requires only one pass over the document collection to generate clusters. The one-pass clustering method is investigated using the ADI collection of 82 documents and 35 queries which is available on-line in the SMART system. Clusters formed are not of uniform

size; one or two early clusters are exceptionally large. Variation of the minimum correlation cutoff value is an adequate control for the number of clusters generated. The effect of document order on the clustering method Overall, the single-pass

is investigated, and the results are inconclusive.

clustering method is surprisingly effective and compares favorably with more complicated clustering methods.

1.

Introduction In information retrieval, a given Item of information is often repre-

sented by an N-dimensional vector, each dimension referring to a different property of the item. Using matrix algebra, comparisons, classifications,

and other processes can be performed on the Information vector or on a long series of such vectors. Classifying the item, that is, placing it in a group of similar items, can save time later when it becomes necessary to refer to that Item again. For example, at the supermarket it is much easier

to find the aisle with breakfast cereals, and then to look for the corn

XIV-2

flakes, than it is to search for them on every shelf of every aisle. The classification problem has two basic aspects:

1)

classification already exists, and each new item is placed in that group to which it is the most similar;

2)

no classification is assumed to exist and groups are formed on the basis of similarities between items. [1]

Both types of processes have advantages and disadvantages. tion already exists, the processing of new items is simple.

If a classificaHowever, the

size of the groupings often becomes unbalanced in time with a resultant loss in efficiency. Merely finding the breakfast food section wonTt save much The second classification

time if half the store is stocked with cereals.

procedure of regrouping every time a few new items are added to the set is time-consuming. Many comparisons and decisions must be made to determine It may be reasonable to include the same item in more

the optimum groups. than one group.

But once the time and trouble have been taken to classify

the items, any specific item is relatively easy to find. In the context of automatic information retrieval, the items of information are documents and search requests, and the classification groups are called clusters. All the vectors in a cluster are averaged to obtain Query

the cluster centroid, a vector representation of the entire group. vectors are correlated with the centroids.

Only those centroids with suffi-

ciently high correlations are expanded; that is each document in the cluster is correlated with the query and retrieved in order of higher correlation. This type of search is known as a cluster search or two-level serach. To

contrast, a full search would correlate every document with the query and retrieve them in order of highest correlation. Cluster searching is a pro-

XIV-3

ven method for reducing search times hopefully without significant loss of relevant documents. Many document grouping schemes representing both solutions are in existence. For example, library cataloging sorts publications into already On the other hand, much research is being done

existing classifications.

to implement the second classification process, that of forming clusters based only on the similarities between the items. In general these methods

require extensive computer time and storage space to calculate correlations between every pair of documents. RocchioTs and Bonner's clustering methods

[2], [3] are of this kind, since they require a processing time proportional 2 to N , where N is the number of documents in the collection. A more effi-

cient clustering method developed by Dattola [1] does not compute the document-document correlation matrix and reduces the processing time to one of order N Log N. A different, highly inexpensive method of clustering has

been proposed; that of examining each document only once, and forming clusters in the process. Such a method is called the single-pass clustering

method and is the subject of this paper. The single-pass clustering method is appealing because it saves a To illustrate,

significant amount of time over other clustering methods.

assume that a clustering method requires 1 minute of computer time for 100 documents. Then to cluster 1,000,000 documents, an N 2 clustering method

will take 170 years, an N Log N method 21 days, while the single-pass method would take only 7 days. The single-pass clustering method also has a certain aesthetic appeal in that it represents a compromise between the two basic aspects of the classification problem. No initial classification exists, but as docu-

XIV-4

ments are processed, classifications are built up. A document is considered for inclusion into all existing clusters before it is allowed to start a cluster of its own. Classification groups are formed principally on the basis of the similarities of an item with an existing group rather than on the basis of similarities between items. However, once a document is accepted in an existing cluster, the cluster centroid is accordingly revised.

2.

The Program A FORTRAN program was written to implement the single pass clustering

method.

A flow diagram of the process is shown in Fig. 1. The single-pass clustering method assigns the first document vector

scanned as a cluster centroid.

The next and succeeding documents are correIf a minimum cutoff correlation is

lated with existing cluster centroids.

equaled or surpassed, the document is included in every such cluster and the cluster centroid is revised Coverlappingl, or the document is included only in that cluster with the highest correlation, assuming the minimum cutoff is again equaled or surpassed Cdisjoint). If the minimum correlation is not

reached, a new cluster is formed with the document as centroid. Two correlation methods — cosine and Tanimoto — can be used with the program. Cosine correlation normalizes all vectors to length 1.0 in N-dimem-

sional space, N being the number of concept-dimensions in every document or1 query vector* The cosine of the angle between the two vectors is computed

and used as a similarity measure. N

Z .a0
Cosrne S, = 1
*"

I di • d2
=
N

2

i=l
, 9 *

N

4 0

i=l

i=l

XIV-5
/

CONTROL CARDS TO INITIALIZE TYPE OF CORRELATION COSINE, TANIMOTO, DISTANCE TYPE OF CLUSTERING: DISJOINT , OVERLAPPING MIN. CUTOFF CORRELATION, DOC. ORDER, ETC.

READ NEXT DOC >_ £ CORRELATE DOC. WITH EXISTING CLUSTER CENTROIDS

READ 1ST DOC

Y

DO ANY CORRELATIONS EXCEED MIN. CUTOFF CORRELATION YES DISJOINT/OVERLAPPING OVERLAPPING INCLUDE DOC . IN ALL CLUSTERS WITH CORR. ABOVE MIN. CUTOFF. UPDATE CLUSTER CENTROIDS

NO

4

DISJOINT INCLUDE DOC. IN CLUSTER WITH HIGHEST CORR. UPDATE CLUSTER CENTROID -<-

Y
DOC. BECOMES A CENTROID OF A NEW CLUSTER

•

<

-

NO"

HAVE ALL DOCS. BEEN PROCESSED? YES

Y

PRINT INFORMATION FOR CLUSTERS FORMED Y ABOVE CLUSTERS - STORED IN SUITABLE FORMAT ARE FED INTO SMART SEARCH AND AVERAGE PROGRAM TO GET VARIOUS MEASURES LIKE PRECISION-RECALL ETC.

Single Pass Clustering Methods Fig. 1

Q

XIV-6

TanimotoTs correlation also performs vector multiplication to obtain a similarity measure. The denominator is a normalizing term to prevent dense docu-

ment vectors from being assigned disproportionately high correlations.

Tanimoto

S dnd

=
d

1" 2 l" 1 " ~2"~2
.5 + d .d - d,.d

~1"~2

N
S

. . 1 ~2

d d 12

N

I (dV + I Cd^i - J ^
1=1 1=1

. .

N

. _ 2

N

1

i=l

The program control parameters are as follows:

1. 2. 3.

Type of clustering —overlapping or disjoint. Correlation Method — cosine or Tanimoto. Minimum Concept Weight for Centroid — To prevent the dilution of the cluster centroid, concept weights of less than this value are set to zero. All runs were made at .005.

4.

Concept bound — the highest concept number of the document collection.

5. Mult — Concept weights of the normalized centroid vector in the single-pass program are less than 1.0. To prevent most of these concepts from disappearing when converting to integer (fixed point) values for the SMART system, all centroid weights are multiplied by MULT. All runs were made with MULT equal to 100., 6. Cutoff — The value of the minimum correlation cutoff, below which a document will not be included in a cluster.

Six control cards cause the above quantities to be introduced into the program, and also determine whether the output will be printed or punched or

XIV-7

both.

Four of the cards are name cards and place the proper SMART control

cards at the beginning of the output to facilitate SMART evaluation.

3.

Investigation and Results Using the 82 document, 35 query American Documentation Institute

collection filed on-line in the SMART CDS (Cornell Data Set), the following investigations of the single-pass clustering method were undertaken:

1)

correlation comparison:

cosine correlation, forward order —

comparison of clusters formed by the two different correlation, methods over a range of correlation cutoff values; 2) Disjoint-overlapping comparison: cosine correlation, forward

order — comparison of disjoint and overlapping clusters over a range of correlation cutoff values; 3). Variation of document order: cosine correlation, fixed corre-

lation cutoff — comparison of both overlapping and disjoint clusters formed with three different initial document orders; 4) SMART evaluation: SMART evaluation of the 6 sets of clusters

formed in part 3 to compare overlapping and disjoint methods as well as the three document orders; al full search of the document collection and evaluation for comparison to single-pass runs; bl SMART evaluation of clusters formed by Dattola's method, for comparison; c) SMART evaluation of Dattola!s clusters formed from initial single pass clusters.

A)

Correlation Comparison

For cosine correlations, correlation cutoff values ranging from .10 to .30 were investigated. The number of clusters formed varied from 5 at

XIV-8

.10 to 31 at .30 with 14 clusters formed at a correlation cutoff of .20. Tanimoto correlation cutoff was varied from .02 to .30. The number of clusters formed varied from 7 at .02 to 81 at .30 with 15 clusters formed at a cutoff of .04. All clustering was disjoint using forward document order. Appendix 1 contains tablesand graphs of the results of all runs performed for this phase of the investigation. For every correlation method, the relationship between the number of clusters generated and the correlation cutoff values is smooth and not unmanageably steep. Suitable variation of the correlation cutoff value

can therefore be used to roughly control the number of clusters that will be formed for a given document order. A striking fact about cosine and Tanimoto clusters is the large variation between sizes of clusters formed, as shown in Fig. 2. The largest clusters are usually among the first 2 or 3 clusters, no doubt because they are formed early and have a chance to inspect nearly the entire document collection. Tanimoto clusters seem to be slightly more evenly distributed

than cosine clusters. There is no apparent relationship between documents included in a given cosine cluster and documents included in a given Tanimoto cluster. Bl Disjoint-Overlapping Comparison For cosine correlation, forward document order, the correlation cutoff was varied from .10 to .30, and both overlapping and disjoint clusters were formed. were formed. At a cutoff of .20, 14 disjoint and 17 overlapping clusters The average document is only included in one cluster for the

disjoint method but is included in 4.8 clusters for the overlapping method. Apparently the dilution of the overlapping clusters resulted in the forma-

XIV-9

Cosine Number of Clusters

.20 Correlation Cutoff Number of Documents in each Cluster 20 or 21 5,6, or 7 3 or 4 1 or 2

Tanimoto Number of Clusters

.04 Correlation Cutoff Number of documents in each Cluster

2 4 2 6

1 3 1 5 5

23 8,9,10, or 11 5,6, or 7 3 or 4 1 or 2 5.5 Average Documents per Cluster

14 Total Clusters

5.9 Average Documents per Cluster

15 Total Clusters

Cluster Size Variations

Fig. 2

xiv-io

tion of three more clusters than the disjoint method.

Any relationship be-

tween the document composition of a given overlap cluster and that of a gi/en disjoint cluster is submerged by the large amount of overlap. The results

of the SMART evaluation runs made on disjoint and overlapping clusters will be discussed in that section. Appendix 1. Tables and graphs for all runs appear in

C)

Variation of Document Order

Three different initial document orders were used for this investigation — forward, reverse, and middle. The forward document order lists Mid-

the documents sequentially, while the reverse order has them backwards.

dle order introduces the documents in an inside-out order; for example, documents 1, 2, 3, 4, 5, 6 in middle order would be 4, 3, 5, 2, 6, 1. Using a

fixed cosine correlation cutoff of .20, both disjoint and overlapping clusters were formed with the three different initial document orders. shows the number of clusters formed for each situation. From this data, it can be tentatively concluded that a 10% or 20% variation in the number of clusters formed from one initial document order to another will not be unusual. Fig. 3

D)

SMART Evaluation

Precision and recall data were averaged over all disjoint and all overlapping runs to obtain a measure of effectiveness reasonably independent of document order. in Appendix 2. Precision-recall graphs of the following runs are plotted

1. 2.

Full search Dattolafs cluster search

FOL CForward Overlap) - 17 clusters MOL (Middle Overlap) - 14 clusters ROL (Reverse Overlap) - 19 clusters FDJ (Forward Disjoint)- 14 clusters MDJ (Middle Disjoint) - 14 clusters RDJ (Reverse Disjoint)- 19 clusters Average Cluster Size 16.2 clusters

Square Rt. of Std. DeV. =( X

fN

2 V2
i ~ XAV

\
/

j =2.07 clusters.

V

--i

Variation in Cluster Size for Different Document Orders

Fig. 3

XIV-12

3. 4.

Single-pass disjoint cluster search Single-pass overlapping cluster search

The full search curve is naturally more effective than other curves. The single-pass disjoint cluster search is the next best search. DattolaTs

cluster search appears higher at both ends of the graph than the singlepass overlapping cluster search, but loses some ground in the middle to high-recall region. Fig. 4 shows for each run, the index of rank recall

plus log precision, another method of system evaluation, and reveals nearly the same results. Using one set of disjoint and one set of overlapping clusters as initial clusters, DattolaTs clustering method was evaluated. Precision-

recall graphs of these results are found in graph 2 of Appendix 2. The disjoint initial clusters seem to give a higher precision at low recall values but a lower precision at high recall values than the overlapping initial clusters. For comparison DattolaTs previous run with unknown initial

clusters is also plotted and falls in the middle. In Appendix 2 graphs 3 and 4, the precision-recall curves of the disjoint and overlapping clusters used as initial clusters for DattolaTs method are plotted. Also Dattolafs results using these clusters as initial

clusters are plotted to determine if any significant improvement in clustering has occurred. For the disjoint clusters no improvement occurrs.

Dattolafs clustering slightly reduces the effectiveness of the initial clusters. For the overlapping clusters, Dattolafs method increases precision

at the low recall end but reduces it at the high recall end of the graph, a significant improvement. It is felt that the apparently high recall and precision results ob-

XIV-13

Search Type

Rank Recall plus Log Precision

Full Average Disjoint Cluster Average Overlapping Cluster DattolaTs Cluster

.6323 .5359 .5086 .4888

A Parameter for Comparing Various Methods

Fig. 4

XIV-14

tained by the single-pass method relative to DattolaTs method is in part due to the expansion of one or two very large clusters, providing a search of over half the document collection. Four more runs were made to reduce the

number of clusters expanded so that a more legitimate comparison between the two methods could be made. Although a reduction in the number of clusters expanded occurrs, no appreciable effect is observed on the performance measures, probably due to the continued presence of those extremely large clusters. For each run, the average documents checked per query are tabulated in Fig. 5. Most single-pass runs checked far more documents than DattolaTs

runs, revealing that a single-pass cluster search is not as efficient as a Dattola cluster search in terms of search time. Recall ceilings, however, naturally favor the system which looks at the most documents. Global comparisons of all evaluation runs are tabulated in Table 2 of Appendix 2. Normalized precision, normalized recall, rank recall, log precision, rank recall plus log precision, and recall ceiling are included.

4.

Conclusions Over the 82 document 35 query ADI collection, the single-pass clus-

tering method as described here is effective.

Because the document collecMore ex-

tion is so small, these promising results cannot be conclusive.

tensive evaluation runs with larger document collections should be conducted before allowing the single-pass clustering method a permanent place in the information retrieval world. The wide gap in the number of clusters formed for different input orders contributes to a growing suspicion that the single-pass method may

XIV-15

be very order-dependent.

The three orders used were not truly random.

The

middle order, for example, is a composite of both the forward and reverse orders as shown in Fig. 6. Further investigations of the single-pass clus-

tering method should include evaluation with random initial document orders to determine more precisely the standard variation of the number of clusters formed at a given correlation cutoff. Disjoint clusters are more effective than overlapping clusters. Re-

ducing the figure of 4.3, the number of overlapping clusters to which the average document belongs, would certainly improve tho overlapping feature. It is recommended that the overlapping algorithm be revised so that a document is included in the cluster with which it correlates the highest (above cutoff) and only in one or two other clusters if the correlation is within the value of a small parameter, epsilon, of its highest correlation. The number of clusters which will be formed can be effectively controlled by the establishment of a range of values for the correlation cutoff. Perhaps the biggest pitfall of the single-pass method is the formation of one or two excessively large clusters. It is these large clusters

which cause too many documents to be checked and reduces the efficiency of the cluster search. To prevent large clusters from forming, it Is recommended

that the single-pass clustering program be extended so that, as cluster size increases, the cosine correlation cutoff value slides up through a sequence of values. As the number of items in a cluster increases, it becomes more Such a program

difficult for a document to assimilate into that cluster.

feature would reduce the large variation in the sizes of the clusters. Finally the possibility of more than one pass through the document collection to determine the optimum cosine correlation cutoff should be con-

XIV-16

Run Description

Average Documents Checked per Query 27.3 23.8 25.5 33.0 39.0 45.6

Recall Ceiling .47 .53 .58 .66 .66 .66

Dattola (MOL initial clusters) Dattola (MDJ initial clusters) RDJ (Reverse disjoint) FDJ (Forward disjoint) MDJ (Middle disjoint) MOL (Middle overlapping)

Comparison of Average Documents for Various Methods

Fig. 5

Forward Order Middle Order Reverse Order

4 4 3 3

5 5 2 2

6 6 1 1

Relation in Three Document Orders Taken — An Illustration

Fig. 6

XIV-17

sidered.

For large document collections a small representative subset could

be reprocessed until an optimum correlation cutoff is determined.

References [1] Dattola, R. T.,A Fast Algorithm for Automatic Classification, Report ISR-14 to the National Science Foundation, Section V, Department of Computer Science, Cornell University, 1968. {2] Bonner, R. E., "On Some Clustering Techniques", IBM Journal of Research and Development, Vol. 8, No. 1, January 1964.

[3] Rocchio, J. J., Jr., Document Retrieval Systems —Optimization and Evaluation, Report ISR-10 to the National Science Foundation, Harvard Computation Laboratory, Cambridge, Massachusetts, March 1966.

XIV-19

Appendix 1

Cluster Size Information

XIV-20

Cutoff

Number of Clusters 7 10 15 27 33 39 >50 >50

Average Size of Clusters 11.7 8.2 5.5 3.0 2.5 2.1 1.2 1.0

Maximum Size of Clusters 35 24 23 18 17 14 4 1

Minimum Size of Clusters 1 2 1 1 1 1 1 1 j

0.02 0.03 0.04 0.06 0.08 0.10 0.20 0.30

Cluster Size Observations for Forward Document Order, Tanimoto Correlation and Disjoint Clusters

Table 1

XIV-21

40 Document Order: Forward Correlation: Tanimoto Cluster Type: Disjoint 30 Number of Clusters

*-

20 •-w /Maximum Cluster Size

10

\

fAverage

Cluster Size

0

_L .02

.04

1

.06

.08

.1

•

X

Cutoff

Recall-Precision Graph Corresponding to Table 1

Graph 1

XIV-22

Cutoff

Number of Clusters Average

Cluster Size Maximum 70 37 21 12 19 Minimum 1 1 1 1 1

0.10 0.15 0.20 0.25 0.30

5 9 14 23 31

16.4 9.1 5.9 3.6 2.6

Cluster Size Observations for Forward Document Order, Cosine Correlation and Disjoint Clusters

Table 2

Document Order: Forward Correlation: Cosine Cluster Type: Disjoint

Maximum Cluster Size

Number of Clusters

0.20

0.25

Cutoff
Recall-Precision Graph Corresponding to Table 2

Graph 2

X1V-24

Minimum

Number of Clusters A Document Belongs To

H

H

H

H

r-\

Maximum

zf

O

CM

CN

LO

H

H

H

^
TJ

c u
H

o

Average

4 CD CN [> J00 it 00 CM C^ H

!

u
2

C D

c Q) 6
:3

4-J C O

rH

0 Q

O

o
fc
IU

b O
•H ai
rH

'Tj

Minimum

S

a, rn fc
(D

rH

H

H

H

H

0 U-.
rH

&
0

> o
T^

CO C D

Cluster Size

4H

H

Maximum

(/l

c nj c 0
•H

^ rd
H

00 I>

LO CD

O LO

[> CM

H CN

c: o •H
4 '

> H

m

+J ffl
r-4 C D

0) LT)

Average

^ CD
O 00 zf CO LO CO CN CO CN CM CD zf zf
CI) N

O CJ C

^ &

C D

-H CO H ci) +J
(/I :1

C O

o a

Number of Clusters

o
LO
r-1 rH [>rH

LO CN

r—| CO

Cutoff

0.25

0.20

0.30

0.15

0.10

XIV-25

Number of Clusters Disjoint (from Graph 2) CL Maximum Cluster Size

A^Average Cluster Size A X

0.05

0.10

0.15

0.20

0.25

±

1
0.30

Cutoff
Recall-Precision Graph Corresponding to Table 3

Graph 3

XIV-26

'

•

'

•

-

•

•'•
•

-

-

1

Document Order Number of Clusters Cluster Size Average Average Maximum Minimum Maximum
-Zt"

Cluster Type

Number of Clusters A Document Belongs To Minimum ]

H

rH
rH
CM

H

H

rH

Disjoint 5.9
H JO LO

Forward
H
1 CO

CM

rH

rH

Overlapping 23.2
i
CO rH CM

H

cn H

rH

H

Disjoint
en rH
•H LO

Reverse 23.5
!
CM

1 rH

Overlapping
rH LO

CO rH

H

H

H

en en

Disjoint
H

H

'

Middle
J-

o

rH

H

Overlapping 1

2.8

•H 00

o
rH •H 4-J
4 '

H

O O

C 0 •H M

o IJH

() a

•H C O

.n o c u M

2 C
C O

£

'1/

H)

o

rd

C O

& a)
•H O •H

',-.

&
• r |

m fc
(

r

O

i-H

4

2
P O

H O

0

r: x
H
<M

4->

J

rH (1)

4H
4J

CM

C D

)

rd

& > & a (D C O a :i

a1

•

XIV-27

Appendix 2

Clustering Evaluation

-28

Recall

Full Search

Dattola's Clusters

Dattola1 s Clusters Using Initial Clus ter As MOL 1 MDJ 2

Sing le Pass Clu sters Average* Overlapping 0.5999 0.5999 0.5981 0.5784 0.5269 0.4951 0.4555 0.3931 0.3864 0.3734 0.3725 0.2620 0.2611 0.2452 0.1854 0.1839 0.1734 0.1567 0.1471 0.1471 0.1471 Average* Disjoint 0.6253 0.6253 0.6251 0.6036 0.5564 0.5268 0.4784 0.4438 0.4417 0.4248 0.4235 0.3063 0.3048 0.2869 0.2117 0.2117 0.2064 0.1903 0.1773 0.1773 0.1773

0.0 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00

0.6356 0.6356 0.6356 0.6142 0.5690 0.5350 0.4939 0.4604 0.4582 0.4482 0.4458 0.3424 0.3399 0.3172 0.2453 0.2438 0.2342 0.2173 0.2053 0.2053 0.2053

0.6152 0.6152 0.6142 0.5900 0.5580 0.4957 0.4686 0.4192 0.4068 0.3952 0.3932 0.2235 0.2235 0.2144 0.1809 0.1809 0.1748 0.1585 0.1585 0.1585 0.1585

0.5468 0.5468 0.5468 0.5332 0.4955 0.4587 0.4213 0.3930 0.3870 0.3527 0.3508 0.2656 0.2644 0.2573 0.1819 0.1819 0.1726 0.1600 0.1572 0.1572 0.1572

0.6192 0.6192 0.6192 0.6008 0.5522 0.5051 0.4736 0.4361 0.4342 0.4085 0.4064 0.2825 0.2816 0.2652 0.1765 0.1765 0.1576 0.144 5 0.1208 0.1208 0.1208

1.

MOL = Single pass clusters with Cosine Correlation, Cutoff 0.2, Document order — Middle, Cluster type — Overlapping. MDJ = Same as above, except Cluster type — Disjoint. These Average Precisions are taken over three Document orders

2. *

Recall-Precision Tables

Table 1

XIV-29

Y
A

O A • •

Full

Search

Dattola's Cluster Search Single Pass Disjoint Cluster Search Single Pass Overlapping Cluster Search ( • and • are averages over three document orders)

•3

4

^=§

3h

K'
lh
I I I I

X°\° ^ O - O QD
A-\
TI--D

I

I

.1

.2

.3

.4 .5 .6 Recall

.7

.8

.9

1.0

Recall-Precision Graph Corresponding to Table 1

Graph 1

XIV-30

8 r~"

PattolaTs Cluster Search

•

Using Initial Clusters as Single Pass Clusters (Disjoint) Cosine, 0.2 Cutoff, Middle Document Order Not Using Single Pass Clusters Using Initial Cluster as Single Pass Clusters (Overlap.) Cosine, 0.2 Cutoff,

* Recall

X

Recall-Precision Graph Corresponding to Table 1

Graph 2

XIV-31

Y .8 A .7 Single Pass Clusters (Disjoint) Middle Document Order, Cosine Correlation, Cutoff Correlation 0.2 Dattola's Clusters (Disjoint) Using Clusters in A as Initial Clusters

O

A—A

Recall-Precision Graph Corresponding to Table 1

Graph 3

Y
.8 A Single Pass Clusters (Overlapping) Document Order Middle, Cosine Correlation, Cutoff 0.20 Dattola's Clusters (Overlapping) Using Clusters in A as initial Clusters

O

c o (/) o
CD

.4 .5 .6 Recall

.7

.8

.9

1.0

Recall-Precision Graph Corresponding to Table 1

Graph 4

XIV-33

1

J O ^

^^ CD
H w

00 CD CO LO

CN O
[>

O O
[>

P

H
C>

P O CN en O

N" I>

iJ CO
CO

a)

cn
J•

•. i

rd o

o• 1

o•
zf00

•

•

o
cn 00 en _r

•

'U •H

r.

CN
J00 ^

O)

f

CI)

/"~N

I *•*O C
CD P

J O S

JH w

| i

c^
-t

1 co - I

co
C xj CO

CD CD

i: rd

k

o•
CD I>

o•
CD CD .1 LO

o•
CN CD

o•
CO

o•
00

o•

G
JM

CD

CD CO

I

w

•P U
CD CD O -P C D H C D bO > M C •H D P, PM" C D P, (1) rd
•'
PM*

H O

2

J O
CM

/P~—\
H
W

H
CD

CD r 1

r . t o) O

o
H
LO

0)

O

•

co rd

o• o
CD -f

o•
00
CN

•

O

•

•

<M

.

I

P-.
C D H bO Q

I>
•"-"*

CD
dLO

»-o en
H

o

oN C
O•

t LO . t 00
'•

CN 00 =t LO

^ P
00 LO

P <i)

r>

C C D D

> o
,1

to p C D 1 '
(0

o:
C)

•

£
00

r^ >->

• o
o
CD CD LO

• o
--J-

«
)

. CO
H
CN CD LO

. o
CD CD

x: n o H H . -H 1 ^
CN

- 1

P

P (d (D C O
CD

CD

CN
. 1 IC)

S

/—S •-D Q ^rH

rH LO

o

c>

H
CN

O

.

CO

•• *r| ^

o •i > ^p 1» « p
p
.. 1
•[ )

a
o

P 0
M i CD

(D

(/)

S

^

O

•

o

•

o o

•

o

•

o

•

o

•

P P

LP
UH

•[

)

,P

C O

o -d
LO

o . ,(/') 1

rd

*"0 Q
CM

/J "—\
H ^>

re-

00 rH
CN if)

rCO
CO CJ

o

CN
CN

o

l> co
T) LO

P 3 O CD CD
O

CO

o• H H
O
LO

o•
CO 00 LO

o•

o•
r 1

c0

•

•

I

w
C D H P rd 0) T H

^

•;*

y—<

[> c > LO

00
LO

C PM O •r 1 P •• rd P-. -H .

~ o

U CO r i rd ^ f ) p U

(/) P
C D

C D

•, \
f

>;
P
'rH 1 ' rd

^ 1(0' P • (!) . 1 P
P
O '1 ]

0

CN C D rH ,P rd H

-n

n

o 0 P
CD

P
r ^

1
C D P.
KO

H

J

cd

CO) <

»"D Q

H H

3
H

P
-H C O

S

^

r

H O

LO CO en

cn N-

CO LO

o•

o•

•

d•

o

•

• ci

U U U
O a C D •p C O O C D i j

cu

I

P
CD

> PI
. i
td ,P
r 1 CD

1J

P.
O

1 1 C — D C O P • b f j co rd P d H «H H O C O O - P LD P rd Q

o

c P.

G

p d P 'T j C - 1 'P D
r

P ci) 1 ' C O

o P o P n P
(D
)

n p

,P

o

CO

•;*

•-—s.

o
CO CD

J O

rH H

s ^

J-

00 0> . 1 t

CD LO [>

CD

CN

O
CD

H O

00 CN CO

,

g

d

()

(D -1 ' rd T I 0 •- 1

oo
LO

HO

P •rH

^

P O 0 Q <l 1

o

;•:
in
r-l

'i J

- 1

P

o•

o•

•

o

•

<>
CO

•

•

Z
C O

u O
H

C O C O H O 4J P rd

U
0

C r t O p :P cl)
1 ' 1 J

s_>

^

p.
•11

rd P

[ |

C CN D P H C v_> O 3 -H

^

[>

CN

I>
CO

CN
LO

o

CD
LO

. -1 IX)

-t

o H o o o
LO CN C )

rH CO

00 cn 1 O

[ Q

O

o

•

o

•

•

c >

•

•

j

CD

H
1
r-l

1 l^M

^ ~
<P O

zr CD O

LO LO

o
CD

00 CN 00 00
c :>

(^)
CN 00 CD

C C O D 0) C D ^J P . 1 . 1 p H p (h CD O C D 0 1! , P r I rd P C O i' X P C O P C D P rd co 4 J C D (=u P. P • P .ci) (0 , P In H CD P M H hi)

P

P.

:<

; • :

::

o J
cO

o

o

•r 1

n

CO

1

p. rd
C D C D CO / /

O

/

~D C
N •H rH rd £ P O S rH H rd O C D
PM*

~D C
N •rH rH rd

''g 3
C O rd C D

C O -H C O -P ,M rd

p •N LO
•r-l

o

£

> / ^ |">

S

e CD o P o CM u ^

rH •H rd
bO O

a CD o
P^

CD ^ PM

o

r* C rd
PM*

CM'

J

rH H rd O C D P^

-f-

O •H (0 -r 1 C O D bO C O U J On

C

s~~<

bO

C D bO

H

H rd O C D P^

-rH <H H C D -H > C D < O ^->

C rd

1