V-l

V.

A Fast Algorithm for Automatic Classification R. T. Dattola

Abstract Several different methods exist for classifying the elements of a file into groups based on similarities in the attributes of the elements. In information retrieval, the elements are frequently documents, and the attributes are words or concepts characterizing the documents. Most known procedures are based on the construction of similarity matrices specifying the pairwise similarities between each pair of elements. Such n-square procedures (for n elements) are expensive to carry out in terms of time and memory space. In the present study, a classification

process of order n log n (for n elements) is described, and convergence proofs are given. discussed. Possible applications to information retrieval are

1.

Introduction Many methods exist for ordering or classifying the elements of

a file.

The elements are usually clustered into groups based on the In information retrieval,

similarities of the attributes of the elements.

the elements are frequently documents, and the attributes are words or concepts characterizing the documents. be divided into two basic categories: Classification of document files may

a)

an a priori classification already exists and each document is placed into the cluster whose centroid is most similar to that document;

V-2

b)

no a priori classification is specified and clusters are formed only on the basis of similarities among documents.

Classification schemes that fall into the first class are very common and often involve manual methods. For example, new acquisitions

of a library are classified by placing them into the clusters of a standard, a priori classification. Problems of the second type are usually more

difficult to handle, and automatic or semi-automatic methods are often used. Methods of this type are widely used in statistical programs, but the number of elements in the file is limited to several hundred, or at most, a few thousand items. In information retrieval applications, the number of

elements may approach several hundred thousand or even a million documents, as in the case of a large library. In the present study, a method is

described which is suitable for classification of very large document collections.

2.

The N

2

Problem

Current methods of automatic document classification usually require the calculation of a similarity matrix. This matrix specifies

the correlation, or similarity, between every pair of documents in the 2 collection. Thus, if the collection contains N documents, N computations are required for calculation of the similarity matrix.* This immediately

* Very often, the similarity matrix is symmetric, so the number of computations is reduced to N^/2.

V-3

poses two serious problems:

the storage space necessary to store the

matrix increases as the square of the number of documents, and the time required to calculate the matrix also increases quadratically. Fortunately, document-document similarity matrices are normally only about ten percent dense, and only the non-zero elements need be stored [1] « However, as N increases, auxiliary storage must eventually be used, and although this solves the space problem, it also magnifies the time problem. To illustrate the magnitude of this problem, suppose that it takes one hour of computer time to classify a one thousand document collection. 4 Then for N = 10 , the time is approximately one hundred

hours, and for N • 10 , the time needed is about 120 years! The classification scheme described in this paper is an adaptation of the one proposed by Doyle, and the time required is of the order of N log N [2]. For example, assuming the logrithm has base 10, and the time required for a one thousand document collection is again one hour, 4 6 then for N = 10 the time is 13 hours, and for N = 10 , the time required is about 83 days.

3. Doyle's Algorithm 2 The N problem is avoided in this classification scheme, because a similarity matrix is never computed. Assume the document set is S. is the set of documents

arbitrarily partitioned into m clusters, where in cluster j. Associated with each set C. and frequency vector F,

S. a corresponding concept vector The concept vector consists

are associated.

V-4

of all the concepts occurring in the documents of vector specifies the number of documents in Every concept in C. S,

S,, and the frequency in which each concept occurs*

is assigned a rank according to its frequency;

i.e., concepts with the highest frequency have a rank of 1, concepts with the next highest frequency receive a rank of 2, etc. (base value), every concept in C. Given an integer b

is assigned a rank value equal to the The vector of rank values is

base value minus the rank of that concept. called the profile P. of the set S..

Fig. 1 illustrates the concept and

frequency vectors, and the corresponding profiles for a sample document collection. Starting from a partition of the document set into m clusters, the profiles are generated as described. Every document d, in the document

space is now scored against each of the m profiles by a scoring function g, where g(d.,P.) = the sum of the rank values of all the concepts from which occur in C,. d.

Fig. 2 shows the results of scoring the documents in

the sample collection against the profiles from Fig. 1. Given a cut-off value T, a new partition of the document set into m+1 clusters is made by the following formula:

S\ = <d.|g(d.,P.)> g(d. ,P, ) and g (d. ,P.)> T, for k = 1,. .. ,m > j ^ i1* l j - ' i k i D ~ J Thus, S. '

.

consists of all the documents that score highest against profile In cases where a P
JT_

P., provided that the score is at least as great as T.

document scores highest against two or more profiles, say

,...,P
XT

, n

1 the following tie-breaking rule is used:

V-5

d

l l

d

2 l 2 4 5

d

3 l 7 8

d

4 l 2 3 5

d

5 l 8

d

6 3

d

7 6 8

c

C C C C

C C C

C C C C

C C

C

C C

C

2

I5

a)

Documents

k
d

c
l
c

l

F
3

l

p
5

l

1 S2 C2 F2 P2
2 d4
d C

S

3

C

3

F

3

P

3

d3 dr 5 !

l c2

1 1 1 2

3 3 3 4

l c2

2

5

2 1 1 2

5 4 4 5

6 C3 d 7 C6
C

d

1
1
X 5

M M

cr 5 c? c8

c3 c4 c5

8

b)

Initial Clusters, Profiles, and Frequencies

Construction of Profiles from Document Partition (base value = 6) Fig. 1

V-6

Document
d
d d d d d

Profile of Highest Score 2 2 1 2 1 3 3

Score 15 19 12 19 9 5 10

i
2 3 4 5 6

^

a)

Document Scoring

fl ^ S_i k
d

3

d

l

d

7

d

5

b)

Resulting Clusters

One Iteration of Classification Algorithm (cut-off = 10)

Fig. 2

V-7

i f d. e S . a n d , i D a) b) j = r ,
K

l<k<n,
— —

t h e n d.
1

is assigned to S * ;
r

j ^ r

for ! ^ k ^ n /

k then d. is arbitrarily assigned to S1 .

Those documents which do not fall into any of the m clusters S! are called loose documents, and they are assigned to a special class L. is now repeated after replacing until P, j P. by P'.. The process

The iteration continues P! = P. 3 3

satisfies the termination condition, which states that

for j = l,...,m; i.e., the profiles are unchanged after two consecutive iterations.

4.

Satisfaction of Termination Condition

A)

Non-convergence of Doyle'o Algorithm

Doyle's algorithm as described is not guaranteed to terminate. To illustrate this, consider the following document collection:

d

l

d

2

d

3

d

4

d

5

d

6

d

7

d

8

d

9

1 2 3 4 5 11

1 5 6 11 12

3 7 8 9 12

3 4 7 8

1 5 6 7

2 3 4 9

2 3 4 7 8 9 10

2 4 5 6 7 8 10

1 2 4 5 6 7 10

Let of S , and P value = 7 ) :

S

= /d

- d \

, S

= /d

- d \ , and let P

= profile

= profile of S .

The two profiles are as follows (base

Concept 1 2 3 4 5 6 7 8 9 10 11 12 13

Frequency 3 2 4 3 3 2 3 2 2 0 2 2 0

Profile 5 4 6 5 5 4 5 4 4 4 4

Concept 1 2 3 4 5 6 7 8 9 10 11 12 13

Frequency 1 3 1 3 2 2 3 2 1 3 0 0 1

Now assume that

T = 0, and partition the document set

S i = |di|g(di,P1)>.g(d.,P2)| and s} - |d.| gCd^P^l gCd.^)
The results are summarized in the following table:

g(d

i'pl)
29 22 23 20 19 19 28 27 28

g(d.,P_) : L 2
25 14 19 21 20 20 37 39 42 = < d„ - d.

Therefore,

S' =

and

{

V-9

According to Doyle's algorithm, P profiles are:

is replaced by P' and p

by P'.

The new

d1-d3
Goncept 1 2 3 4 5 6 7 8 9 10 11 12 13

d

4 ~d9 Profile 3 5 4 6 4 4 6 4 3 4 2

Frequency
2 1 2 1 2 1 1 1 1 0 2 2 0

Profile
6 5 6 5 6 5 5 5 5 6 6 -

Concept 1 2 3 4 5 6 7 8 9 10 11 12 13

Frequency 2 4 3 5 3 3 5 3 2 3 0 0 1

Now the document set is again partitioned and the results are:

g(di|Pl) d d2 3 d4 d5 d, d? dQ
O
d

g(d.,P2) 22 11
17

34 29
27

21 22 21 31 31 32

20 17 18 32 33 34

dg

Therefore, S" = ( dn

,.{,

- d^ ) and S' = <d

- d

/.

These are the

original sets, so that the algorithm will never terminate for this example.

V-10

B)

Termination of Modified Algorithm

Although Doyle's algorithm is not guaranteed to terminate, Needham proved that similar types of iterative methods are guaranteed to terminate in a finite number of steps [3]. A small change in Doyle's method produces The modification occurs

an algorithm that is guaranteed to terminate. after the calculation of the S!.

Instead of automatically replacing the

old P. by P'., the following condition must also be satisfied:

ieS',

ieS'.

If the above condition is not satisfied, P, is left unchanged. Before proving that this new algorithm is guaranteed to terminate, it is desirable first to make the algorithm more general by allowing overlap between the clusters. The following theorem proves the termination of a

method which allows overlapping clusters.

Theorem:

Let the subscript n designate the nth iteration.

Let D repre-

sent the document space and let P^ . ,. . . , P_ represent m ^ 0,1 0,m ^ initial profiles corresponding to an arbitrary distribution S^ . , . . . , S^

0,1

0,m

of documents in D.

Given a cut-off value T, the nth iteration is defined as follows: 1. Generate the sets S _,..., S and n,1 n,m L n by

S
L

,-jd, |g(d. ,P

. .) > T )

= / loose documents ^

V-ll

f',i i f E9(di'p-j' *E
2. Let P . j ieS . ies P _ . otherwise n-l, 3

g(d.,p . .) i n-l,]

V
This algorithm is guaranteed to terminate in a finite number of iterations, where termination occurs when P . m P n . for all j. n,D n-l,j Proof: Extend the document space D to a new document space D# containing m distinguishable copies of every document in D. condition that Sn document. Also, add the

, can never contain more than one copy of each /D Clearly, any S . defined on D# in this manner can also
n,j

be represented on D as defined in the theorem. S

Conversely, any

. defined on D can also be represented on D# as defined above.

Thus, it suffices to prove the theorem on D# under the added condition. Define a tunction F , which will be shown to be monotone increasing n in n, by the following:

F

n

=

) F .+ T-Z , where ^/^ n,j n

F

.= n,j

) ) /

g(d. ,P n. .) and g(d.,p i n-lo

ieS n,D Z n = number of documents in L . n is replaced by F', where

After step 2 of the iteration, F

F' . = n,D

E

gCd.,p
i

n, D

, ;

ie.S n,D

V-12

If for any J , P . T± P - ., then F 1 . > F .* and therefore F f > F . y j ' n,: n-1,] n,D n,j n n If termination occurs; i.e., P For the n+lth iteration, . = P . . for all j; then F1 = F . » n,j n-1,3 n n

n+1

n+l,j

E E
j-l

r n

F

, , + T- Z „, where n+1' n+l,D

g(d.,P . ) .
i
n,:

n+1, j Consider the relation between the contribution of d, to F' and F ,, 1 n n+1 and note that each d. (where copies of a document are distinct) contributes once and only once to both F 1 and F 2 n n+1 following table: document d.
1

This relation is summarized in the

relation between contribution of d, to F , and F' i n+1 n

a)

was assigned to S . and now n,D 1. to S _ . (d. did not n+1,j l change clusters)
2

'

to t o SS

n+l,k n+l,k' k ' j

_ ., k ± j (d

i did

> (g(d. ,P
i

.) < T; other-

change clusters)

wise d, would be in S - .. l n+1,j Also, g(d. ,P _) > T) i n, K, > (g(d. ,P
i

n, j

to L

n+1

.) < T.

Now

n, j

d. contributes 1-T = T to

b)

was assigned to L 1. to L n+1

and now = (contributes T to both). > (gave T for F' , and now _

n+1, j

gives g(d., P
I

.) > T for n,u -

Fn+1 ,) This statement is not necessarily true in Doyle's algorithm.

V-13

Therefore, F a) 1 and b)1 F

> F'. n+1 — n

If P

n,j

. = P . . for all Jj , then from n-1,]

_ = F1. n+1 n

Therefore, if the termination condition is On the other hand, if F , = F , then F n+1 n n = F1, n

satisfied, then F , . = F . n+1 n

which occurs only when termination occurs. Thus, F n is a monotone increasing function, where F , = F if the ^ n+1 n Given m and T f F depends only on the

termination condition is satisfied.

distribution of the documents of D# in S_,...,S * Since there are only 1 m a finite number of distributions, there are only a finite number of values for F. Therefore, at some iteration n, F _ must equal F . n+1 ^ n

5.

Implementation The algorithm described in the preceding section is not implemented.

Instead, experiments are performed using an algorithm which differs from the preceding one in four important respects:

1.

the extra condition necessary for convergence that is mentioned in section 4B is not implemented; i.e., P. is always replaced by P';

2.

termination occurs when S* , = S* , for all j , where S* , is n,j n+1,j n,j the subset of S . consisting of all those documents that score n,D highest against profile P .; let H
n,i

3.

, = max. (g(d.,P
* l

.)), and define S
n,j

. as
n,j

S

. = / n,j \

l£j<m d Igld./P n / . ) i ' y v i ' n-l'D I H , a- (H .

. >T . n-l,i - T) > ( , where

, if H

.

. > T, where 0<a<l

'

J T otherwise

V-14

4)

if any S

. contains fewer than 2 documents, then S

, is elimi-

nated, thereby reducing the number of clusters by one.

The advantages of this method over the one defined in the theorem are discussed in the present section? the disadvantage is, of course, that termination is not guaranteed. To show this, note that conditions 1 and 2

above are equivalent to the termination condition in Doyle's algorithm, since in Doyle's method P and S then T . always corresponds to the new partition S Also, if a = 0 in condition 3,
l

.,

. = S* . (no overlap is allowed).
n-l,i

, .= H

n-l,i

, ..

Thus, only those documents d. that score highest 2 Therefore, with

against P

_ ., where H _ . > T, are assigned to S ,. n-i,] n-l,i — n,j

a = 0 this method is equivalent to Doyle's algorithm. The first two modifications are implemented to improve the efficiency of the program. Although convergence is no longer guaranteed, all the Programs without

experiments tried so far have in fact always terminated. these two modifications run about twice as slow.

Also, in cases where the

overlap is not too high (£>* . ~ S . ) , the new termination condition is n,j n,D usually equivalent to the one used in the theorem. That is, when S* . = S* _ . , then very often S , J n+l,D n,j . = S n .. n+1,3

The third modification does not improve efficiency, but it allows a more flexible, and intuitively, a more desirable method for creating overlap. The algorithm described in the theorem assigns a document d. to a cluster Sn . iff g(d,, P _ .) > T. 'J i n-l,j — This has two major disadvantages:

V-15

1)

the overlap cannot be increased independently of the number of loose documents; increasing the overlap by lowering T in general decreases the percentage of loose documents;

2)

the difference between d.'s highest score and d.'s second
1 i

highest score is ignored; e.g., if T = 50, g(d,,P ) = 200, and g(d.,P) = 50, then d.is assigned to both S„ and S^. ^ l 2 i * 1 2 The first problem decreases the flexibility of the algorithm, since the amount of overlap and percentage of loose documents cannot be varied independently. other problem. The example in the second part illustrates the

It seems desirable that a document should be assignable

to two or more clusters when it scores equally (or almost equally) as high against all of them. into account. The previous method does not take this fact

In the new algorithm, documents are assigned to more than

one cluster on the basis of how close the score is to the highest score, relative to the cut-off value T. The parameter a determines how close to When a = 0, no overlap occurs,

the highest score the other scores must be.

while a = 1 generates the maximum amount of overlap.* The last modification increases the efficiency of the program, and also avoids forming clusters around documents which should be classified as loose. When S . contains only one document, and that document is

contained in no other clusters, then it has the same status as a loose document.

Ik

With a = 1/ the formula reduces to T definition of S .as in the theorem.

, = T; hence, it is the same

V-16

6.

Experimental Results The algorithm described in section 5 is used to cluster the 82

document ADI collection and the 200 document Cranfield word stem collection. The results of the classification indicate three important problems:

a)

the scoring function g tends to give higher scores to documents containing a larger number of concepts; thus, many of the documents containing very few concepts are classified as loose;

b)

the documents do not move freely enough from one profile to another; i.e., the final clusters are quite similar to the initial ones;

c)

the initial clusters cannot be chosen arbitrarily.

A)

The Scoring Function

The first problem is due to the fact that g scores a document d, against a profile concepts in concepts than score. d. d P, by simply adding up the rank values of all the If d. contains a larger number of d, to receive a higher

which appear in P..

, the chances are greater for

Fig. 3 is a plot of the score of the document against its final

profile vs. the number of concepts in the document for one of the ADI runs. Although there are a few exceptions, the graph indicates that the documents with a larger number of concepts generally receive higher scores. In

fact, the average number of concepts in a loose document is 11, while the average number of concepts per document for the entire collection is 20. The solution to this problem is to weight the score inversely by the number of concepts in the document. The obvious answer is to divide

V-17

8 r
CO 0> w
C Z3

•

7 k. _ Cluster 3 \ 6 5 4 3
•

/Cluster 1 /
# * !^
c , U 8 t e r 4

y^^—Cluster 2

0)

o o

2il

•

/

J

#

0

1 10

1 20

I 30

1 40

l 50

No. of concepts/document

Illustration of Initial Scoring Function

Fig.

V-18

the score by the number of concepts, but this overcompensates and gives many of the smaller documents the highest scores. Dividing by the square

root of the number of concepts in the document does not solve the original problem; i.e., larger documents give higher scores. Satisfactory results 7/8 .

are obtained when the score is divided by (# of concepts per document) Fig. 4 represents the same ADI sample as Fig. 3, except that the new scoring function h = g (# concepts per document).' is used.

Unlike the

function g, h seems to be independent of the number of concepts in the document.

B)

Movement of Documents

The second problem is clearly indicated by examination of the results of the classification. clusters for the ADI collection. Table 1 shows the initial and final The problem occurs because the documents This problem

tend to "stick" to the clusters that they are already in. is solved by a method similar to that used by Doyle.

During the first few iterations, documents should be allowed to move freely from cluster to cluster, until a nucleus is formed within each cluster. The nucleus consists of those documents that are most Once the nucleus is formed, these Clusters

highly correlated to one another.

documents will probably not move from their present clusters.

can be forced to contain only very highly correlated documents by raising the cut-off value T, assuming that documents with the highest scores are most similar to the other documents in the cluster. investigated later. This assumption is

However, raising the cut-off value results in a This is resolved by repeating the

larger number of loose documents.

V-19

Cluster 3 30 25 20 o > o 15 o <o 10
J.

Cluster 4 Cluster I
• •• .•*!•••

'Cluster 2

0

10

20

30

x

X

X

40

50

No. of concepts/document

Illustration of Modified Scoring Function

Fig. 4

20

•f> •

•

•"••

•' " "•

•

Cluster 1

Initial Documents 1 - 12

Final Documents 1 - 11, 13, 21, 30, 33, 34, 40, 43, 51, 68 3, 10, 13 - 24, 26, 33, 34, 53, 69, 79 9, 11, 13, 20, 22, 23, 25 28, 30 - 34, 36, 47, 51, 55, 65, 75 4, 7 - 9, 14, 20, 30, 37 48, 51, 69 1, 5, 7, 20, 30, 32, 45, 47, 51 - 53, 55 - 59, 79, 80 2, 9, 27, 30, 47, 51, 61, 62, 64 - 71 10, 40, 51, 72 - 75, 7 7 - 8 1 12, 29, 35, 49, 50, 54, 60, 63, 76, 82

2

13 - 24

3

25 - 36

4

37 - 48

5

49 - 60

6

61 - 71

7 Loose

72 - 82 —

Final Results of ADI Classification

Table 1

V-21

classification for a lower value of T, but using the clusters from the first classification as the initial clusters. This creates the problem of how to determine the initial value of T, and how much to decrement it when the classification is repeated using as initial clusters the results of the first classification. The

initial value of T should be high enough so that only those documents which score very highly against profile P. are assigned to S,. One

method of achieving this is to pick T so that the clusters after the first iteration average q documents, where q is small compared to the total number of documents* set at 4. In the experiments run so far, q is arbitrarily

After termination of the first classification, a nucleus is T is now chosen so that a certain percentage

formed within each cluster,

of the loose documents are assigned to clusters after the first iteration of the second classification. Assuming it is desirable to have approxi-

mately x percent of the documents loose after the final clusters are formed, two approaches are possible:

a)

T is lowered far enough so that only x percent of the documents remain loose after the first iteration; thus, after termination of the second classification, the clusters represent the final results;

b)

T is lowered just enough to allow a certain percentage of the loose documents to be assigned to clusters after the first iteration; thus, the classification is repeated until approximately x percent of the documents remain loose.

Experiments performed using both methods indicate that the second approach allows greater control of the loose documents, with only slightly

V-22

greater execution times.

After the first classification/ a large proportion Therefore, if x is not too high, method

of the documents still remain loose. a decreases T by a large amount.

This injects many new documents into the

clusters, and several iterations are necessary before termination occurs. Also, T is chosen so that the percentage of loose documents is x at the end of the first iteration, but it is impossible to know beforehand the percentage of loose documents after the final iteration. In general, the

more iterations, the more the final percent varies from the percent after the first iteration. In method b, T is lowered just enough to allow a

fairly small percentage (20% in the present experiments) of the loose documents to be assigned to clusters. This normally results in only a few

iterations before termination occurs; therefore, the final percent of loose documents does not vary much from the percent loose after the first iteration. The ADI collection is reclassified using the procedures described above, where it is desired that about 25% of the documents remain loose. Once again seven initial clusters are used, and the initial value of T is calculated to be 28.2 so that the clusters after the first iteration average four documents. However, in this case cluster 3 is assigned ten documents, Thus, these three After

while clusters 1,5, and 6 contain only one document.

clusters are eliminated, and the documents within them become loose.

termination occurs, the final clusters are used as initial clusters for the next classification, where T is set to 19.1. The process is repeated again

for T = 16.8, and after termination 17% of the documents remain loose. Table 2 shows the final results of this classification. Compared with Table 1,

many more of the documents have moved from their initial clusters.

V-23

1

—

Cluster 1 2

Initial Documents 1 - 12 13 - 24

Final Documents

3, 5, 9, 10, 14 - 17, 20 28, 30, 34, 37, 43, 45, 48, 53, 57 - 59, 64, 68, 69, 72, 79, 80 1, 2, 5, 6, 21, 24, 27, 41, 43, 47, 58, 61, 62, 79, 80 8, 11, 13, 20, 28, 30, 36, 39, 51, 53, 55, 56, 65 - 68, 70, 71

3

25 - 36

4 5 6 7

37 - 48 49 - 60 61 - 71 72 - 82

7, 31, 42, 44, 46

4, 9, 19, 32, 40, 51, 73 75, 78, 81 12, 18, 29, 35, 38, 49, 50, 52, 54, 60, 63, 76, 77, 82

Loose

—

Final Results of New ADI Classification

Table 2

V-24

C)

Initial Clusters

In the present study, the initial clusters are determined by assigning the first p (or possibly p+1) documents to cluster 1, the next p (p+1) to cluster 2,..., and the final p to cluster m, where p = (total number of documents) / m. Since the nucleus of each cluster depends quite strongly

on the initial clusters, it is not surprising that different initial clusters lead to different results. If the initial clusters are chosen

at random, it is unlikely that the documents within each cluster are very similar. Thus, the nucleus of each cluster might not be very tight.

This problem is solved by insuring that the initial clusters contain at least a few documents that are highly correlated. In the ADI

and Cranfield collections, the order of the documents is such that many adjacent documents are quite similar; therefore, most of the initial clusters contain a few highly correlated documents. In collections where

the order of the documents is random, a simple, fast clustering scheme can be used to determine the initial clusters. This type of an algorithm

need only perform document-document correlations within a fraction of the document space, and therefore, should not take up much time.

D)

Evaluation of Results

The assumption was made earlier that those documents of a cluster S. that score highest against the corresponding profile P, are most

similar to the other docuntents in the cluster.

The phrase "most similar"

is used to mean "correlate most highly", where a standard correlation function is used. Table 3 compares the score of each document to the

average correlation (unweighted cosine function) of each document with

V-25

Cluster 1 Document 25 5 64 23 27 34 15
37

Cluster 2 Document | 8 5 20 68 2 70 39 28 58 36 61 56 66 67 80 43 33 21 11 65 27 71 41 79 24 13 51 53 6 62 55 1 30 Score 18.7 18.8 18.8 18.9 19.2 19.2 19.2 19.3 19.4 19.5 19.6 19.6 19.7 19.9 20.0 20.0 20.0 20.0 20.0 20.0 20.0 20.1 20.2 20.4 20.4 20.5 20.6 20.7 21,0 21.4 21.5 21.7 21.8 Cluster 4 Document 32 51 74 4 75 9 19 73 78 40 81 Score 24.5 25.0 26.0 26.3 26.4 26.5 26.9 27.0 27.1 27.4 27.6 Avg. Corr. .10 .15 .16 .18 .16 .19 .17 .19 .18 .24 .21 Avg. Corr. .09 .12 .12 .10 .12 .13 .14 .11 .12 .14 .11 .14 .12 .12 .14 .14 .15 .15 .14 .15 .13 .14 .15 .16 .15 .15 .12 .16 .18 .18 .17 .20 .18

Score 19.1 19.6 20.1 20.2 20.3 20.6 20.6 20.7 20.8 20.9 21.0 21.0 21.0 21.1 21.2 21.2 21.3 21.4 21.5 21.5 21.6 21.6 21.7 21.7 21.8 21.8 22.0 22.0 22.1 22.2 22.3 22.4 22.4 Cluster 3

Avg. Corr. .08 .12 .10 .13 .10 .11 .11 .14 .12 .12 .12 .14 .14 .12 .14 .13 .15 .13 .13 .15 .16 .15 .14 .16 .15 .17 .17 .17 .17 .17 .17 .18 .17

1

48 58 28 53 20 68 80 57 59 14 16 79 43 24 69 26 17 72 21 3 9 30 22 45 10

|
1

Document 31 46 44 7 42

Score 31.0 31.2 31.3 31.8 33.2

Avg. Corr. .21 .05 .14 .24 .28

Score vs. Average Correlation for ADI Classification Table 3

V-26

every other document in the cluster.

The documents are arranged in

ascending order by scores, and hopefully, the correlations will also appear in ascending order. As the table indicates, there is a strong tendency Table 4

for the higher scores to correspond to the higher correlations.

illustrates the same results for three out of seven final clusters from the Cranfield collection. So far nothing has been said about how to choose the base value (section 3) that is used to compute rank values. important effect on the type of clusters produced. This integer has an Recall that the rank Suppose a

value of a concept equals the base value b, minus its rank. cluster S. contains four documents d

- d , and a total of twenty

different concepts.

The lowest possible rank value for any concept = If b = 20, then the lowest Consider

b - 4, since 4 is the lowest possible rank.

rank value is 16, while if b = 5, the lowest rank value is 1. a document d. . which is the same as d P..

except for one concept, and With b = 20, g(d.,P.) is

assume this concept does not occur in

between 16 and 20 points less than g(d ,P.); with b = 5, g(d.,P.) is only 1 - 5 points less.

Since large clusters have profiles containing many concepts, the chances of a random document d. having concepts in the profile of a d, having concepts in the will score much

large cluster are greater than the chances of profile of a small cluster.

Therefore, if b is high, d.

lower against the profile of the small cluster, and large clusters will tend to capture all the remaining loose documents at the expense of the smaller clusters. Experimental results support this hypothesis, i.e., a

large base value produces a few clusters with many documents, and many clusters with only a few documents.

V

Cluster 1 Document 26 6 7 117 2 121 13 19 60 23 18 44 183 116 128 61 9 197 16 198
3

Cluster 2 Document 3b 97 15 1 34 145 Score 19.9 20.2 20.3 20.3 20.4 20.4 20.8 20.9 20.9 21.1 21.1 21.2 21.3 21.3 21.3 21.6 21.7 21.7 21.8 21.8 21.9 22.0 22.4 22.5 22.5 22.8 22.8 23.1 23.2 23.9 23.4 24.3 25.3 Avg. Corr. .12 .13 .11 .13 .13 .14 .12 .13 .14 .15 .15 .15 .15 .15 .14 .13 .13 .17 .16 .16 .16 .18 .15 .19 .17 .18 .15 .19 .19 .19 .18 .18 .21

Score 22.0 22.3 22.4 23.0 23.1 23.2 23.3 23.4 23.7 23.8 23.8 24.1 24.2 24.3 24.5 24.6 24.6 24.7 24.7 24.7 24.8 24.9 25.0 25.1 25,6 25.6 25.9 26.6

Avg. Corr. .12 .13 .13 .14 .14 .15 .15 .16 .15 .17 .17 .18 .17 .18 .18 .18 .18 .19 .20 .17 .18 .20 .21 .21 .21 .20 .21 .23

i

171

25 28 115 58 181 56 160

172 30 4 140 72 138 143 141 27 36 157 59 156 200 32 137 29 148 57 128 44 31 139 56 160 58

Cluster 3 Document 179 154 79 133 134 77 132 78 76 74 75 Score 31.4 32.5 32.8 32.9 33.4 33.7 33.8 34.1 34.3 34.4 34.5 Avg. Corr. .19 .24 .27 .29 .28 .27 .32 .30 .34 .34 .36

Score vs.. Average Correlation for Cranfield Classification Table 4

V-28

If, on the other hand, b is set so that the lowest rank value in an average cluster is 1, then there is a tendency for small clusters to get larger and large clusters to get smaller. In smaller than average clusters,

all the rank, values are high, since there are only a few different ranks. In larger than average clusters, the rank value as defined might become zero or even less than zero. In these cases, it is redefined to be 1, but then Thus, a document

it is possible for many concepts to have a rank value of 1. often scores higher against the profiles of smaller clusters.

The results of the Cranfield classification clearly indicate the ability of a document to score higher against profiles of smaller clusters. . During the classification, nine clusters are generated, and cluster 9 starts to grow much larger than average (average = 22 documents). keeps growing until it contains 27 documents, and then it starts to oscillate. The following numbers indicate the number of documents in cluster 27, 21, 34, 17, 56, 01 Thus, cluster 9 is It

9 on successive iterations: eliminated.

The same thing happens to cluster 8 on the next few iterations.

Although this tends to keep the size of the clusters somewhat uniform, it is not desirable to throw away a cluster which might contain many highly correlated documents. One solution which might be implemented is to split

up large clusters into several smaller ones; i.e., classify the documents within a single cluster. If the number of documents in the cluster is not 2 algorithm to do this.

too large, it might be practicable to use an N

7.

Conclusion The classification algorithm that has been described in sections 5

and 6 requires the following parameters as input:

V-29

a) b). c)

maximum number of clusters desired; approximate percentage of loose documents desired; decision on whether or not loose documents should be "blended" into the nearest cluster at the end of the classification;

d)

amount of overlap desired.

The first parameter specifies the number of initial clusters that are formed. If no clusters are eliminated during the evaluation, The experiments run so

then the maximum number are actually generated.

far indicate that the number of clusters produced is usually only about 60% of the maximum. The next two parameters determine the "tightness" of the final clusters; the higher the percentage of loose documents/ the tighter the clusters. If no loose documents are desired, parameter b can be set to 0, Almost

but very low percentages increase the running time of the program.

identical results are obtained in less time by specifying about 15% loose, and then asking for all loose documents to be assigned to the cluster to which they score highest. The last parameter determines the amount of overlap. corresponds to a in the formula This number

H
T

n-l,i

_ . - a • (H

n-l,i

. , - T) , if H

n-l,i

_ . > T

n-l,i T otherwise which was mentioned in section 5. When a = 0, no overlap is produced, and The actual percentage

with a = 1, the maximum amount of overlap is produced.

of overlap for a given value of a depends on the collection, but results indicate that 10% overlap for a = .4, and about 20% for a = .6.

V-30

Although the algorithm is not guaranteed to terminate, convergence has always been obtained in practice. In order to prevent the program from

looping in cases of non-convergence, the algorithm can be modified to permit a maximum of n iterations, whether or not convergence is obtained. The results indicate that clusters change very little after about four or five iterations, so that this modification would not make much difference in the final clusters* The true evaluation of the final clusters can only be made by actually performing two-level searches on the clustered document space. However, the algorithm is sufficiently general to allow for the evaluation of many different types of clusters.

V-31

References

[1]

K. S. Jones, D. Jackson, Current Approaches to Classification and Clump - Finding at the Cambridge Language Research Unit, The Computer Journal, Vol. 10, May 1967. L. B. Doyle, Breaking the Cost Barrier in Automatic Classification, SDC paper SP-2516, July 1966. R. M. Needham, The Termination of Certain Iterative Processes, The Rand Corporation memorandum RM-5188-PR, November 1966,

[2] [3]