XIII-1 XIII. Experiments with a Fast Algorithm for Automatic Classification R. T. Dattola Abstract An algorithm for automatically classifying Items in a collection is described, where the time necessary is proportional to N is the number of items in the collection, p m N*p«log p , where is the final number of clus- ters desired, and the algorithm. is the number of clusters produced at each level of Clusters are produced for two different document collections, and the results are evaluated by performing centroid searches on the clustered collection. Finally, a new evaluation measure is defined, and compari- sons are made with clusters produced by other automatic methods. 1. Introduction In order to use real-time automatic document retrieval systems on a very large data base, it Is necessary to classify the documents so as to avoid searching the entire collection. However, the classification of a document collection containing more than a few thousand items is not feasible with most procedures for automatic classification. A method Is needed which can classify hundreds of thousands of items into useful clusters in a reasonable amount of time. The method presented in this study is an outgrowth of earlier attempts at fast procedures for automatic classification. [1,2] The algo- rithm is first presented in a general form, and a proof is given which describes how the time needed to classify a given collection Is related to XIII-2 the number of documents in the collection. Then the algorithm as implemented is described in detail and experiments in classification and evaluation are discussed. 2. General Description Consider a document collection consisting of N documents with an average of r concepts per document. p Assume that the document collection S. is the set of docu- is partitioned into ments in cluster file vector equal sized clusters, where Associated with each set S, j . is a corresponding pro- P., consisting of the rank values of all the concepts from S. . The rank value is equal to a constant (base value) the documents in minus the rank of the concept, where concepts are ranked in decreasing order of the number of documents in the cluster in which they occur. A cycle of the algorithm is defined as follows: a) each document each of the p d. in the collection is scored against g(d-,P.); profiles by a scoring function b) let H. = max. CgCd.,P.>), (.i.e., H. l<3

K i (H.-a«(H.-K), if I l l K = / i | K otherwise where 0 < a < 1 and K is a specified cut-off score; then define new clusters c) SI = {d . /g(d ,9P • ) > K. } ; ] i i ] — I all documents d. such that H. p/e = 1 . Therefore, T is p = 3 • minimized for p = e . Rounding off to the nearest integer gives Although the computation time is minimized for p = 3 , in practice However, it mighT: not be useful to form only three clusters at every level. since p T has only one minimum point, the time increases monotonically as increases. The algorithm as described can be used to generate multi-level clus- ters. However, it is generally more effective not to equate one level of the For exam- algorithm with one level of a multi-level classification scheme. pie, suppose a set of 10 documents is to be classified into two levels, with 100 centroids on the first level and 10,000 centroids on the second level. XIII-6 This can be done directly with two levels of the algorithm where In this case CO Ji p = 100. CO Q T = k-10 -10 -log 10 = k-10 -10 -2 = 2k-10 . However, the specifications can also be satisfied by using four levels of the algorithm instead of only two, where the results of the second and fourth levels are used as the two levels of clusters. c . ii n In this case, p = 10 and n T = k-10 -lQ-log Q10 = k-10 -4 - 4k-10 . 3. Implementation in this section the algorithm is described in detail exactly as pro- grammed. The major change from the general description Is that only one Thus, in all the experiments, level of the algorithm has been implemented. p = m and T = k-N-m-log m = k-N-m In addition, this section describes several alternate methods for generating the initial clusters that are necessary to start the algorithm, and a formal definition of overlap is given. The algorithm is designed to control three basic parameters: number of clusters, amount of overlap, and size of clusters. The number of clusters and amount of overlap are input parameters, while the size of clusters is internally fixed to vary no more than one-half to twice the average cluster size. In addition to these, several other parameters are controlled as explained throughout this section. XIII-7 A) Initial Clusters In order to execute the classification algorithm, it is necessary to designate initial clusters. Four different methods of initial cluster gener- ation are implemented, and they are referred to as the correlation, similarity correlation, frequency, and random methods. All of the methods involve choosing only one document per cluster that is used as a seed to start the algorithm. The correlation method uses the cosine correlation to locate documents that are highly correlated with many other documents. 1) Randomly select / N documents from the collection, but exclude those documents that have already been defined as cluster seeds. 2) Compute the global average cosine correlation (GAVG) and standard deviation CSTD 1 for the sample: al start with the first document In the sample and compute the cosine correlation between this document and all other documents in the sample; b) calculate the mean value CMEAN) of the correlations computed in step a; c) d) repeat steps a and b for all documents in the sample; define GAVG as the mean value of all the values computed in step b; el 3) calculate STD for step d. MEAN > GAVG + STD. _ Use the chosen documents as cluster seeds if Assuming a normal distribution, approximately 16% of the documents in the sample are used as seeds. M-) Repeat steps 1-3 until the number of seeds = m. as an initial cluster. Use each seed The similarity correlation is identical to the previous method except XIII-8 that step 3 is changed as follows: 3) Use the chosen documents as cluster seeds if MEAN ^ GAVG + STD and if the document correlates below a specified cutoff with each of the documents already defined as seeds. This additional condition prevents two very similar documents from being used as seeds for different clusters. The frequency method does not use random samples or correlation coefficients, but directly locates documents which have many concepts in common with other documents. 1) Calculate the frequency score (FS) for every document in the collection: a) for every concept in the document calculate its frequency f , where f = the number of documents in the collection in which the concept occurs; b) FS - the sum of the frequencies of all the concepts divided by the number of concepts in the documents. 2) Rank the documents in order of decreasing frequency scores, and pick the top m documents as cluster seeds. Finally, the random method simply picks out the collection to be used as cluster seeds. m random documents from B) Overlap The coefficient used to measure the overlap between clusters is an m dimensional extension of Tanimotofs two dimensional correlation coeffiThe most obvious way to measure overlap is to simply compute the averWhen this number is 1, there is no overlap cient. age number of clusters/document. between clusters. However, the main problem with this measure is that the XIII-9 upper bound depends on the number of clusters; e.g., for ten clusters the upper bound is 10, for one hundred clusters the upper bound is 100. Of course, the upper bound is reached when every document occurs in every cluster; i.e., each cluster is equivalent to the entire collection. The generalized form of Tanimotofs coefficient has a lower bound of 0 and an upper bound of 1 independent of the number of clusters. a binary vector of length N specifying the documents in cluster Let S. = i ; i.e., c- ( - \ i J 1 if document j is in cluster i x i: 0 otherwise i ^ - S.CkX _i ; i.e., the number of documents in cluster i . Let I. . = S. r\ S. . Then #1. . = the number of documents occurring in i and cluster j . TanimotoTs two dimensional coefficient is: both cluster T = #1. ./C#S. + #S. - #1. .) Extending t h i s to m dimensions y i e l d s : m-1 i=l m j=i+l >J m i=l m-1 i=l m j=i+l 5j From now on, the numerator of () will be referred to by NUM . j Theorem ( has a minimum value of 0 and a maximum value of 1. ( ) Furthermore, it attains these bounds if and only if there is no overlap, or if all clusters are identical respectively. Proof a) Clearly | is 0 when ) S, p, S. = () for all j i and j , s i j , XIIL-10 since #1. . is always 0* On the other hand, () can only be J negative or undefined if Cm-li } i=l #S. < NUM . However, this is impossible since m-1 m m-1 m Cm-U b) Let S I 1=1 S, = Sn 1 2 4S. = I ]=1 = ... I 1=1 #S. > I ]=1 Then i . I 1=3+1 #S. and #S. > #1 . . I . . = I_ . = i] 1,1 = S . m for a l l # S . = #S1 , and I 1 ' Therefore, 1 => # 1 . . = #S. i,3 1 < = NUM/[Cm-l) > | m £ 1=1 #S_ - NUM] , b u t NUM = m-1 I 1=1 m I 3=1+1 #S. => <|> = [ C m - D + C m - 2 X + . . . + l J * . # S 1 / r C m - l ) - m # S = [mCm^l^S1/2]/{mCm-l)iS1 -m(m-l)#S = [ m ( m - l ) # S 1 ] / [ 2 m C m - l ) # S 1 - mCm-l)#S ] = [m(m-l)#S ] / [ m ( m - l ) # S ] = 1 . - CCm-l)+. . . + l ) -#S ] /2] (J) c a n o n l y be g r e a t e r t h a n 1 i f If s o , then NM U is greater than the denominator. NUM>Cm-ll m I # S . - NUM l i=l m y i=l 2-NUM > Cm-1) #S. [C#s 0 +...+#s )+#s ] + [ ( # S V K . . + # S )+C#s +#s 0 )] 2 m l 3 m 1 2 XIII + ... + [#s +(#s +...+#s ,)] m l m-1 m-1 m = I . n I . 7 , #S.+(m-li#S1+(m-2)#S0+...+#S n m-1 j=m-l m ]=m 1 m-1 z m n m-1 m-1 = I #s1+...+ j=l m 3=2 m-1 m J I #s m _ 1 + I m-1 m I #s J i=l j=l+l i = l ]=i+l m-1 m - I m-1 I #s + I m I #s J 1=1 j=i+l 1=1 j = i+l = I 2 I ^ J 1=1 j=itl ' S. > I. . But this is impossible since C) Algorithm m clusters After the initial clusters are generated, there exist with one document in each cluster. (K) The cutoff point for loose documents is now set so that at the end of the first cycle each cluster will At the end of each iteration, K is reset so that average five documents. x percent of the loose documents will be clustered at the end of the next cycle, where x is an input parameter. than y The iterations continue until less y is also an input percent of the documents remain loose, where parameter. Finally, an option is provided to allow the loose documents to be assigned to the cluster against which they score highest Cblending), or to simply group them all together into an additional cluster. So far nothing has been said about the choice of the base value that is used in calculating the rank values of concepts in profiles. As discussed XII1-12 elsewhere (l), the base value should be set a little above the average cluster size at the beginning of each iteration. fold: The purpose of this is two- first, to prevent the rank values from dropping below 1, and to allow If the base value is too low, documents to move freely between clusters. many rank values drop below 1 and all are reset to 1 even though the concepts have different frequencies. is too low. Fig. 1 illustrates a case where the base value For present experiments the base value is set to twice the average cluster size at the beginning of each iteration, but experiments have shown that this is probably too high. Even in the largest clusters, the lowest rank values do not come close to 1. Another parameter which is controlled is the size of the clusters. This is accomplished by deleting those clusters whose size falls below onehalf the average, and by breaking up those clusters whose size exceeds twice the average. rithm. These checks are made at the end of every cycle of the algo- Clusters which get too large are broken up into two non-overlapping clusters by the following algorithm: 1) Select a random sample from among the documents in the cluster and generate seeds as in steps 1-3 of the correlation method for initial cluster generation. 2) Pick those two documents from the seeds that correlate lowest with one another, and use these two documents as cluster centers for the two new clusters. 3) Assign the documents in the original cluster to the cluster center to which they correlate highest. Consider step b in the general description of the algorithm in section 2. 0 < a < 1 . The parameter However, a a is used to control the overlap, where does not correspond to the actual overlap between XIII-13 Concept Frequency Rank Rank Value c i 7 5 4 2 3 1 1 1 1 2 3 5 4 6 6 6 3 3 1 1 1 1 1 1 °2 °3 C 8 9 C c C io 5 % base value = 5 Example of Low Base Value Fig. 1 XIII-14 clusters as computed by () . After every cycle of the algorithm, the overj OVER , which is the lap is computed and compared with the input parameter requested amount of overlap. value of () approaches j The parameter a is then adjusted so that ~:he OVER . However, when loose documents are blended into the nearest cluster at the end of the algorithm, the amount of overlap between clusters decreases' since each loose document is assigned to only one cluster. But as the percent of loose documents at the end of the algorithm is an input parameter, the effect of blending on the overlap can be predicted beforehand. Instead of adjusting XOVER so that 0 a so that () approaches J OVER , () ( should approach () equals j OVER after blending. The change in after blending occurs only in the term m Cm-1) I #S. i=l i since some clusters increase in size. m The change is Cm-lX • C I #S. + y-N) i=l where y is the percent of documents loose at the end of the algorithm. () should equal j OVER, where . After blending m OVER = NUM/[Cm-1> • C } #S. - y-N) - NUM] X i=l Before blending, in XOVER = NUM/{Cm-1)• V §S. - NUM] X i=l XOVER must be expressed in terms of OVER . Rewriting . OVER, OVER = XOVER m m -Km-1) • £ #S. - NUM]/ICm-1) • £ #S. - NUM+Cm-1) -y-N] i=l X i=l X XIII-15 Solving for XOVER, m m XOVER = OVER -[(m-1) I #S. -NUM+Cm-l)-y*N]/[(m-1) I #S. -NUM] i=l 1 i=l 1 m XOVER = OVER • [1 t Cm-l)-y-N/CCm-l) I #S. -NUM)] . 1=1 1 After every cycle, XOVER as follows: is reset by the above formula and a is adjusted 1) Lf () XOVER , set | ) The quantity (XOVER-(()) is the difference between the required overlap and Cl-a) or a represents the maximum amount that the actual overlap, and a can be changed in the proper direction. One final input parameter controls the centroid definition. Cluster centroids are defined as the profile of concepts and their corresponding rank values as weights. An input parameter z is provided which determines what percent of the concepts in the profile will be used to define the centroid. The concepts in each profile are sorted in decreasing order by rank z percent of the concepts are used to define the cenz percent are actually used, since the last values, and the top troid. Normally more than concept in the top taken. z percent may have several ties, and all ties are also 4. Evaluation The final evaluation of the clusters can only be made by performing centroid searches on the classified collection. Two separate types of evalu- XIII-16 at ion are possible: internal and external. The internal evaluation attempts to determine how variations in the parameters of the classification algorithm affect the search results. The external evaluation compares the. retrieval results from clusters produced by this algorithm with other algorithms and with a full search. Al Evaluation Measures Standard recall-precision curves that are used to evaluate full searches are not satisfactory for centroid search evaluation. The problem is that recall-precision curves do not take into consideration the amount of work which must be done by the retrieval system. Since the main advan- tage in centroid searching is a reduction in the number of query-document correlations, it is important to include the amount of work performed in any evaluation. The total number of correlations in a centroid search is equal to the number of query-centroid correlations plus the number of query-documen'i correlations. Dividing this number by the total number of correlations (N) , yields a fraction which compares the This fraction is reOne way of represen- necessary for a full search amount of work in a centroid search to a full search. ferred to as the correlation percentage (C.P.). [3] ting the results is to include the correlation percentage along with the standard recall-precision measures. However, experiments have shown that the standard recall-precision measures improve as the C.P. increases, even when comparing the results from different sets of clusters. It is diffi- cult to decide whether a cluster set that produces a normalized recall of .75 and a normalized precision of .60 with a C.P. of .20 is better or worse than another set of clusters producing a normalized recall of .70 and a no:?- XIII-17 malized precision of .55 with a C.P. of .15. The C.P. can be controlled to some extent during the centroid search, but quite often it varies by ten percent or more from the desired value. This is due to the difference in sizes of the clusters, and cannot be avoided as long as all the documents in the selected clusters are correlated against the query. A method is proposed which modifies the recall-precision curve by taking into account the value of the correlation percentage. query Q which has two relevant documents, R and R . Consider a Assume, for example, that the total number of documents in the collection is 20. Fig. 2a shows the results of a possible centroid search where one cluster containing five documents is selected. Fig. 2b shows similar results for a different set of clusters where one cluster containing ten documents is searched. In both cases the two relevant documents are retrieved at the However, if same ranks, so the recall-precision figures are identical." there are four clusters in both cases, then C.P. = 9/20 for case (a), while C.P. = 14/20 for case Cb). Thus, the evaluation measure should rate case Ca) better than case Cb), since the same retrieval results are obtained with less work. Consider now a method in which the precision is not held constant after a recall of 1, but instead is allowed to drop until the rank is equal to the total number of query-centroid plus query-document correlations that have been made. Fig. 3 illustrates the recall-precision results for the previous example using this new method of evaluation. -In the present. SMART evaluation system, the precision is held constant after all the relevant documents are retrieved. XIII-18 1 L - t Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Doc. R .5 p 1.0 .5 Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Doc. R .5 .5 .5 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1,0 p 1.0 .5 .33 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 R i R l N N R N N R .5 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 .33 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 2 2 N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N 1 20 a) 1 20 b) Five Documents Compared Ten Documents Compared Standard Evaluation Measure Fig. 2 Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Doc. R .5 .5 .5 p 1.0 .5 R l N N R .33 .5 .4 2 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 N N N N N N N N N N N N N N N N .33 • 29 .25 .22 .2 .18 .17 .15 .14 .14 .14 .14 .14 .14 .14 ! a) Five Documents Compared b) Ten Documents Compared New Evaluation Measure Fig. 3 XIII-20 Another problem occurs when some of the relevant documents are not retrieved. Suppose as in the previous example, that there are two relevant illustrates these re- documents but only one of them is retrieved. Fig. 4 sults where both set of clusters retrieve the relevant document at the same rank.Once again the recall-precision results are identical, even though case Ca) proved as effective as (h) at less cost. Instead of assigning all the unrecovered relevant documents to the lowest ranks, they are distributed uniformly thoughout the ranks greater than the total number of correlations performed. The first unrecovered relevant is always assigned the middle rank, and the others are spaced uniformly above and below it. sample case. Fig. 5 illustrates this assignment for the Since case Cal did less work, the unrecovered relevant is A graph which plots recall vs. assigned a higher rank than in case Cbl. precision at every rank Cdocument level) using the modified results reflects, the superiority of case (a) as illustrated in Fig. 6. A graph which plots recall vs. precision at selected recall points Crecall level) is not used in the evaluation, because only the highest precision for a given recall is used in the graph. Notice that in cases where all the relevant have not been retrieved, the precision is held constant after the assigned rank of the last relevant, since the C.P. is already taken into account during the assignment of the unretrieved relevant documents. In order to illustrate the effects of the new evaluation process on an actual collection, a set of ADI clusters called using both evaluation measures. BASE is considered Four centroid searches are performed on -Relevant documents not retrieved are assigned to the lowest possible ranks in SMART. X Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 lb 17 18 19 20 Doc. R .5 .5 .5 .5 .5 .5 •5 .5 .5 .5 .5 .5 •5 .5 .5 .5 .5 .5 •5 .5 P 1.0 .5 .33 .25 .2 .17 .14 .13 • 11 .10 .09 .08 .08 .07 .07 .06 .06 .06 .05 .1 Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 2Q Doc. R .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 P 1.0 .5 .33 .25 .2 .17 .14 .13 .11 .10 .09 .08 R i R l N N N N N N N N N N N N N N N N N N R N N N N N N N N N N N N N N N N N N R .5 .5 .5 .5 .5 .5 .5 .5 .08 .07 .07 .06 .06 .06 .05 .1 2 2 a) Five Documents Compared b) Ten Documents Compared Standard Evaluation Measure Fig. 4 X1II-22 Rank Doc. R P Rank Doc. R p 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 R i .5 1.0 .5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 R l .5 .5 .5 .5 .5 c 1.0 .5 .33 .25 .2 .17 .14 .13 .11 .1 .09 .08 .08 .07 .07 .06 .06 .11 .11 .11 N N N N N i ' J N N N N N N N R N N N N N N N N N N N N N N N N R .5 .5 •5 •5 .5 .5 , .5 .5 .5 .5 .5 .5 1.0 1.0 1.0 1.0 1.0 1.0 .33 .25 .2 .17 .14 .13 .11 .1 .09 .08 .08 .07 .13 .13 .13 .13 .13 .13 .5 .5 .5 .5 .5 .5 .5 .5 ,5 .5 .5 1.0 1.0 1.0 2 N N N N N 2 N N a) Five Documents Compared b) Ten Documents Compared New Evaluation Measure Fig. 5 X1II-23 Case (a) Case (b) i i .4 .5 .7 .8 1.0 Recall Document Level Graph for Example of Fig.5 Fig. 6 XIII-2H the base set of clusters; BASE2, BASE3, BASE4, and BASE5. BASE2 searches Fig. 7 the top two centroids, BASE3 searches the top three centroids, etc. shows the search results using the standard recall-precision measure, and Fig. 8 shows the results using the new evaluation measure. in each graph are the results of a full search. For the highest ranked documents, both curves are almost identical, since the new adjustments usually do not affect the high ranks. However, Also included at the lower ranks the new adjustments favor the searches with the lower correlation percentage. The high recall end of Fig. 7 clearly illustrates This is to be expected the superiority of those searches with a high C.P. since more relevant documents are retrieved with higher C.P. searches. In Fig. 8 the searches with the highest correlation percentages are still better at the high recall end, but the difference is much smaller than in Fig. 7. In fact, there is very little difference between BASE4, Also, the difference between BASE5, and FULL, and between BASE2 and BASE3. the BASE2 curve and the FULL curve are much less. When comparing the search results obtained from different sets of clusters, an attempt is made to keep the C.P.'s as equal as possible, even though the new evaluation measure adjusts for differences. However, when the C.P.'s differ by as much as .15 (BASE4 compared to BASE3), the graph of the higher C.P. search can be expected to be higher. B) Internal Evaluation The purpose of the internal evaluation is to investigate how changes in the input parameters affect the multi-level search results. All the experiments are conducted on the 82 document ADI collection, and the 200 document Cranfield thesaurus collection. The following parameters are in- XIII-25 i.c-rSt .8h .7}Q BASE2 (C.R =.43) * • • A BASE3 (C.R =.56) BASE4 (C.R = .70) BASE5 (C.R = .8 I) FULL Recall Evaluation Using Standard Measure Fig. 7 '111-26 1.0 .9 .8 .7 D BASE2 A BASE3 o BASE4 • BASE5 A FULL (C.R (C.R (C.R (C.R =.43) =.56) = 70) = 81) o : c - .6 5 a> *~ Q_ .4 .3 .2 a^fl£=!& 0 0 _L _L _L .4 .5 .6 Recall .7 .8 .9 1.0 Evaluation Using New Measure Fig. 8 XIII-27 vestigated: 1) initial cluster generation a) b) c) d) e) correlation similarity correlation frequency random results from a one-pass algorithm; [5] 2) 3) 4) number of clusters; overlap; cutoff; i.e., percent of documents allowed to be loose at the end of the classification; 5) percent loose clustered on the next iteration. in addition to evaluating the retrieval results, the internal evaluation should determine how close the classification comes to satisfying the input parameters. However, it is probably better not to "force" a classifi- cation algorithm exactly to satisfy input parameters such as the number of clusters and amount of overlap. The algorithm should allow some variation For example, suppose Then clearly On the in these parameters depending on the collection used. a collection consists of repetitions of three distinct documents. three clusters should be produced, no matter how many are requested. other hand, it is difficult to obtain experimental conclusions if the parameters are not controlled. Therefore, the algorithm is designed to keep the in practice it might be better parameters within fairly small boundaries. to relax some of these restrictions somewhat. Experiments are performed by varying one of the five parameters and comparing the results with those obtained with a base classification. [4] XIII-28 The parameters for the base classification usually fall somewhere near the middle of the two extremes* Table 1 lists the classification parameters for the ADI collection, and Table 2 illustrates the same parameters for the Cranfield collection. requested (In) The first two columns specify the number of clusters (Out); the next two and the number actually produced columns show the percent overlap requested and the percent produced, and the next two columns specify the percent of documents loose at the end of the algorithm. "Percent loose taken" specifies the percent of loose documents The results of the experiments clustered at the end of each iteration. are shown as document level recall-precision graphs using the new evaluation measure, and the C.P. figures are included when available. C) Initial Clusters The initial cluster evalution includes several sets of cluster generations and centroid searches. experiments are performed: For the ADI collection, the following 1) 25% loose, < .5% overlap (Fig. 9 ) ; _ a) b) c) FREQ Cfrequency), C0RR2 (correlation), FASTD Cinput from one-pass algorithm, 0% overlap); 2) 25% loose, 1-3% overlap (Tig, 10); a) b) c) CORRI (.correlation), RANDOM (random), FASTO Cinput from one-pass algorithm, overlapping clusters); 3) 40% loose, 0% overlap CFig. 1 1 ) ; a) b) c) OVRO Ccorrelation), OVEROSIM (similarity correlation) OVEROFREQ (frequency). XIII-29 Run ! No. of CIus ters In Out 10 5 15 10 10 10 10 10 10 10 10 10 10 10 10 10 9 9 9 9 . 9 10 9 9 9 4 15 9 9 8 10 9 9 9 9 9 10 10 9 9 8 7 8 9 11 7 9 11 11 14 Percent Overlap In Out 5 5 5 0 10 15 5 5 5 5 5 5 5 0 5 0 0 0 0 0 .3 13.3 .1 .2 .2 .2 .4 .3 1.2 .4 Cutoff ( Loose) % In Out 10 10 10 10 10 10 0 20 30 10 10 10 10 10 10 10 12 13 4 12 12 12 2 23 23 12 3 6 12 12 11 11 7 7 8 11 4 Percent Loose Taken 40 40 40 40 40 40 40 40 40 20 60 80 40 Approx. Time (min.) 1.1 1.0 .8 .7 .8 1.3 1.0 No. of Iter No. of Cycles BASE CLUST5 CLUST15 OVRO OVR10 OVR15 CUTO CUT20 CUT30 LSE20 LSE60 LSE80 BASEFREQ OVROFREQ BASES1M OVROSIM FREQ CORR2 CORR1 SIM RANDOM ROCCHIO BEST FASTO FASTD MDJ i . . . . . • • • 3 5 1 3 3 4 5 2 2 5 3 2 3 2 4 4 7 6 6 6 4 1 6 12 4 5 7 12 9 5 5 9 6 5 6 6 9 9 17 17 13 14 15 6 13 7 J .7 1.0 .8 .8 1.0 1.0 .9 .9 1.7 1.3 1.1 1.4 1.4 .7 1.4 1.0 4a 40 40 25 25 25 25 25 - o 5 5 5 5 5 a .5 ,2 1.1 1.0 2.2 1.5 5 5 5 10. 5 10 5 5 3 3 3 2 5 0 0 3.3 0 0 3 8 5 60 25 25 ADI Clusters Table 1 XIII-30 - • • ' - - . r • i , ~ • • - • • • • • • Run No. of Clus ters In Out 15 10 20 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 12 9 20 13 14 14 13 13 14 13 13 15 13 12 13 15 18 10 13 18 15 15 Percent Overlap In Out 5 5 5 0 10 15 5 5 5 5 5 5 5 0 5 0 5 5 5 5.4 6.1 1.7 0 15.1 15.3 5.2 5.4 1.1 4.1 1.5 1.3 3,8 0 5,7 0 7,1 3.5 8.8 2,9 2 ,5 3.2 Cutoff (%Loose) In 10 10 10 10 10 100 20 30. 10 Id 10 10 10 10 10 10 5 5 10 — Out 13 13 14 9 8 8 3 15 27 11 11 a 9 10 12 13 13 8 4 Percent Loose Taken 40 40 40 40 4040 40 40 40 20 60 80 40 40 40 40 40 5050 - Approx. Time (min.) 6.2 2.7 3.9 2.5 5.6 6.0 6.6 4.5 2.4 No. of Iter No. of Cycles BASE CLUST10 CLUST20 OVRO OVR10 OVR15 CUTO CUT20 CUT30 LSE20 LSE6Q LSE80 BASEFREQ OVROFREQ BASESIM OVROSIM RANDOM CORR FREQ ROCCHIO BEST RANCLUST 6 3 3 5 4 4 4 4 3 9 3 2 5 4 4 3 3 4 4 3 — 47 26 20 17 31 35 37 29 17 69 14 10 24 13 30 13 30 25 35 23 - aa 2,3 2,0 3.5 2,4 4.4 2.8 6.4 ^2 6,0 3.6 ^ 15. ^ 60 — Cranfleld Clusters Table 2 XI .0 .9 .8 • o A FREQ C(|)RR2 FASTD •6r •5r •X. •3r A o >c*^i>=. J L =&=E^* 0 .1 .2 .5 .7 .8 .9 1.0 Recal ADI Initial Clusters 2 5 % Loose < . 5 % Overlap Fig. 9 XIII-32 1. u r"~ • o ^ C(|)RRI RANDOM FASTCj) 1 1 1 ^ \ _l * 1 1 _J 1 1 1 l_ 1 1 1 1 —I <-*• A .5 .6 .8 1.0 Recall ADI Initial Clusters 2 5 % Loose, 1 - 3 % Overlap Fig. 10 X I II 1.0 .9 .8 .7 •6| c o £ .5 o the first method is better than the second; the first method is much better than the second. c) >> The results of the three ADI experiments with initial clusters are as follows: 1) 2) 3) CFREQ = FASTDl > C0RR2 C0RR1 > CRAND0M=FAST0) OVRO = OVROSIM = OVROFREQ. Unfortunately, the results do not show that any method is consistently Detter than any other. The only conclusion is that the frequency, the similarity correlation, and the use of disjoint clusters from a one-pass algorithm never perform worse than any other method, while the random method and overlapping clusters place last in their only test. The following initial cluster experiments are performed with the Cranfield collection: 1) 40% loose, 4-7% overlap (Fig. 12); a). b) BASE (correlation), BASESIM Csimilarity correlation), XIII-35 • BASE (C.P = .26) o BASBSIM (C.P=.26) A BASEFREQ ( . . =.25) CP D RANDOM ( . . =.27) CP Recal Cranfield Initial Clusters 4 0 % Loose, 4 - 7 % Overlap rig. 12 XIII-36 c) d) BASEFREQ (frequency), RANDOM (random). 2) 40% l o o s e , 0% o v e r l a p Grig. 1 3 ) ; a) b) c) OVRO ( c o r r e l a t i o n ) , OVROSIM ( s i m i l a r i t y c o r r e l a t i o n ) , OVROFREQ ( f r e q u e n c y ! . The results of these two initial cluster experiments are: 1) 2) (RANDOM = BASEFREQ) > (BASE = BASESIM) OVRO » OVROFREQ > OVROSIM Once again the results are inconclusive. The correlation method perforins better than the other two methods in the nonoverlapping case, but does not do as well as the random or frequency methods in the first experiment. Sur- prisingly, the random method actually performs better than the correlation or similarity correlation methods. This indicates that the algorithm is insensitive to the type of initial clusters used, or that none of the methods used so far is very good. Additional experiments must be carried One method which out using different methods of initial cluster generation. can be tried involves running the algorithm twice. tial clusters are generated using the random method. The first time the iniThe clusters produced by this run are then used as the intial clusters for the second run. D1 Number of Clusters In the remainder of the experiments, the correlation method of initial cluster generation is used, and unless otherwise specified, all other parameters are identical to the BASE run. For any given collection, the XI[1-37 • o A 0VRO (C.R = .23) (|)VROSIM (C.R =.20) (|)VR0FREQ (C.R =.24) •—• 0 .4 .5 .8 1 * - • .9 1.0 Reca Cranfield Initial Clusters 4 0 % Loose, 0 % Overlap r i g . 13 XIII-38 number of clusters can be chosen to minimize the search time, assuming that all clusters contain the same number of documents. This assumption is not too invalid since the cluster sizes are maintained between one-half and twice the average cluster size. Consider a collection of N documents classified into n documents. q m clusters, where each cluster contains on the average For a centroid search which looks at the documents in the top time is ST = m + q*n . Assuming () = 0 , J clusters, the search and ST = m+q*N/m . n = N/m The search time is minimized for dST/dm = 0 . dST/dm = 1 - q*N/m2 . dST/dm = 0 => m = / q-N For the ADI collection, if one centroid is searched, and are searched. ST is minimized with = 13 / 82 = 9 clusters / 164 clusters if two centroids Of course, there is no guarantee that optimizing the search The following experiments are performed with time yields the best results. the ADI collection: CFig. 14). ll 2) 3) BASF C9 clusters!, CLUST5 C4 clusters!, CLUST15 Cl5 clusters). The parameters for the centroid search allow the number of centroics searched per query to vary for different queries. A specified minimum is always searched for each query, but additional centroids may also be searched if the query-centroid correlation is close enough to the correlations of the centroids already chosen. Thus, the average number of centroids searched In experiment 1 a minimum of two cen- is usually higher than the minimum. troids is searched, while in experiment 2 one centroid is searched, and in A iii 1.0 .9 .8 .7 .61 • A o BASE CLUST5 CLUSTI5 c o CD 0_ .4 •3r • 2r- z$~5=&=-<* 0 _L _L 0 .4 .5 .8 1.0 Recal ADI Number of Clusters •.- i . XI1I-40 experiment 3 three centroids are searched. The minimum number of centroids The approx- searched is chosen to keep the search times approximately equal. imate search times for each experiment are: 1) 2). 3) ST = 9 + 2-82/9 = 9 + 164/9 = 9 + 18 = 27; ST = 4 + 1*82/4 = 4 + 21 = 25; ST = 15 + 3*82/15 = 15 + 246/15 = 15 + 16 = 31. The results of the experiment show that BASE > CLUST15 > CLUST5 . For the Cranfield collection, the following experiments are made: (Fig. 15) 1) 2) 3) BASE (12 clusters X, CLUST10 C9 clusters), CLUST20 C20 clusters). All of these searches are made using a minimum of one centroid, so the approximate search times may be quite different. Fortunately, the actual C.P.'s ST figures: are available, and they can be compared with the 1) ST = 12 + 1-200/12 t 17 = 29., C.P. = .26 => STT (actual search time) = .26-200 = 52; 2) ST = 9 + 1*200/9 = 9 + 22 = 31, C.P. = .34 => STf = .34*200 = 68; 3) ST = 20 + 1-200/20 = 20 + 10 = 30, C.P. = .25 => STf = .25-200 = 50. All the ST values are lower than the actual search time, but this is mainly due to the fact that more than one centroid is often used, and the overlap is not 0. Fig. 15 shows that (BASE = CLUST10) > CLUST20 . Unlike XIII-') 1 .0 .9 .8 .7 • A o BASE (C.R = .26) CLUSTIO (C.R =.34) CLUST20 (C.R = .25) o £ .5 o CD °- .4 .2 0 J i i i i -p^fe" ^^ ? = r -i r r .6 .7 .8 .9 i_ 0 .1 .2 .3 .4 .5 1.0 Recall Cranfield Number of Clusters fig. 15 XIII-42 the ADI results, the search with the smallest number of clusters does better than the search with the largest number. This is probably due to the larger number of clusters used with the ADI CLUST15. At any rate, both the ADI and Cranfield results indicate that the middle number of clusters does best. Notice that both sets of BASE clusters contain very close to (V 82 = 9, / 200 = 14). / N clusters. E) Overlap An inspection of Tables 1 and 2 shows that the amount of overlap produced is very often much different from the amount requested. The requested overlap is varied from 0% to 15% in steps of 5%, but in both collections several of the requests were not satisfied. sons can be made for the ADI experiments: However, the following compariCEig. 16) 1) BASE C.3%X, 2) OVRO C0%>, 3 1 0VR15 (13. 5% X. The results indicate that BASE > C0VR0=0VR15l . Thus, for the ADI collec- tion a small amount of overlap performs better than a large amount of overlap and better than 0 overlap. The following experiments are performed for the Cranfield collection: (Fig. 17) 1) BASE (5.4%), 2) OVRO (0%), 3) 0VR15 (15.3%). These results indicate that OVRO >> BASE > 0VR15 . Once again the clusters; XI I [-'13 D BASE A OVRO o OVRI5 •5h •°^fc V. _L -L _L J_ 3A :0 =% J L 0 .4 .5 .6 .7 .8 .9 1.0 Recall ADI Overlap Fig. 16 XL L I - 4 4 • BASE (C.R = .26) OVRO (C.R= .23) 0VRI5 (C.P= .43) c: O CO o a) Recall Cranfield Overlap Fig. 17 XIII-45 with a very large amount of overlap perform poorly, even though the C.P. for 0VR15 is about twice as great as the C.P. for BASE and for OVRO. However, a surprising result is the better performance of the clusters with 0% overlap compared to those with 5% overlap. Since the best ADI results are ob- tained with only .3% overlap, the conclusion is that a very small amount of overlap is helpful, but more than 5% is too much. be run using an overlap between 0—5%. should be obtained in this range. More experiments need to It appears that the best results F) Cutoff The value of the cutoff determines how many documents are left loose at the end of the algorithm. the nearest cluster. collection: In all cases, the loose documents are blended into The following experiments are compared for the ADI CFig. 18) 1) 2) 3) BASE (12% loose), CUTO (2.5% loose), CUT20 (23% loose). The main advantage of a high cutoff is that it allows the algorithm to terminate faster. The cutoff should be chosen as high as possible to shorThe ten the cluster time, without seriously affecting the search results. ADI results show that (BASE = CUT20) > CUTO . This is surprising since it is expected that the lower cutoff would produce better results. The following experiments are run on the Cranfield collection: (Pig. 19) 1) 2) 3) BASE (13% loose), CUTO (3.0% loose), CUT20 (16% loose), X I I I - 46 J 1.0 .9 .8 .7 .6 • A • BASE CUTO CUT20 o if) oD C i_ c h .4 .3 .2 \ 0_ •s .3 f ^ . \ . ^ . _ •Jke& .8 .9 1.0 0 .4 .5 .6 Recall ADI Cutoff Fig. 18 XI I \-i\7 Lot • o * • BASE (C.R =.26) CUTO (C.R =.28) CUT20 (C.R =.36) CUT30 (C.P. =.25) Cranfield Cutoff Fie. 1 9 XIII-48 4) CUT30 (27% loose). Notice that the overlap for CUT30 is only 1.1%, while for the other runs it is 5-6%. 5-6%. that The experiments on overlap indicate that 1% is probably better than The results show Thus, CUT30 might do well because of its overlap. CEASE = C U T O ) > CUT30 > CUT20 . The Cranfield results favor the lower cutoff values, and it is possible that CUT20 would have done better than CUT30 if their overlaps were equal. However, the experiment with 3% loose does not do better than the Since the lower cutoffs do not perform as well in the run with 13% loose. ADI experiments, the best cutoff appears to be 12-13%. G) Percent Loose Clustered This parameter controls the percent of the loose documents that are clustered at the beginning of the next iteration. Thus, 40% means that 40% of the remaining loose documents are assigned to clusters after the first cycle of the next iteration. For both the ADI and Cranfield collections the following experiments are performed: 1) BASE O Q % ) , 2) 3) 4) LSE20 (.20%), LSE60 C60%), LSE80 (80%). The ADI results appear in Fig. 20 and the Cranfield results in Fig. 21. For the ADI, (BASE = LSE60 = LSE80) > LSE20, and for the Cranfield, LSE60 >> BASE > (LSE80) . In both cases LSE20 is poorest, but LSE60 is much better than the .'•:i i i - ' i ' i • • o * BASE LSE20 LSE60 LSE80 .4 .3 •2r •'r oj J L J_ ?^^s=x .6 .7 .8 .9 0 .3 .5 1.0 J-> Recall ADI Percent Loose Clustered Fig. 20 XIII-50 • a • A BASE (C. R = .26) LSE20 (C. R = .29) LSE60 ( . =.29) CR LSE80 (C.R =.22) Cranfield Percent Loose Clustered F i e . 21 XIII-51 others for the Cranfield results; therefore, 60% loose gives the best results over both collections, H) External Evaluation On the basis of the internal evaluation results, the following conclusions are made regarding the "best" input parameters for the classification algorithm: 1) 2) initial clusters — no conclusion (correlation used); number of clusters —approximately / N clusters; e.g., the best results are obtained with 9 clusters for the ADI collection and 13 for the Cranfield collection; 3) 4) 5) overlap — probably somewhere between 1-3% (2% used); cutoff - 10%; percent loose clustered — 60%. An additional run called BEST is made using these parameters as input. Un- fortunately, all of the output parameters did not satisfy their input requests; e.g., the overlap is 0% for the ADI and .5% for the Cranfield. However, LSE60 from the Cranfield clusters matches the parameters almost exactly — 13 clusters, 1.5% overlap. Eig. 22 shows the recall-precision curves for the ADI BEST, OVRO, and EULL. Fig. 23 shows BEST, LSE60, and FULL for the Cranfield collection. Also included in Fig. 23 are the results of a random clustering of the collection. Both ADI BEST and OVRO have 0% overlap, and the results show that FULL > In the Cranfield results, the surprising fact is the poor showing BEST > OVRO . of BEST. However, the run which actually contains the desired output parameThus, ters, LSE60, performs better than the full search up to 50% recall. (FULL = LSE60) >> BEST >> RANCLUST . XIII-52 D o A OVRO BEST FULL •© J_ J_ 0 .2 .4 .5 .8 .9 1.0 Recal ADI Full Search Comparison Fig. 22 XIII-53 1.0 .9 .8 .7 o o A • LSE60 (C.R = .23) BEST (C.R = .20) FULL RANCLUST (C.R = .18) c o (/> o .6 .5 •4h •3h .2h 0\ 1 o .2 .3 T—^—zl .4 .5 y-zz^gp^^.^ ^ .6 .7 .8 .9 1.0 Recall Cranfield Full Search Comparison F i g . 23 XIII-54- Comparisons can also be made with cluster results from Rocchio's classification algorithm. {6J Several sets of clusters are available as out- put from Rocchio's algorithm, and those which yield the best retrieval results are chosen for comparison. Table 1 and Table 2. These clusters are called ROCCHIO in The Rocchio clusters are compared against those runs which most closely match the number of clusters and overlap of ROCCHIO. Thus, SIM is used for the ADI results, and RANDOM is used from the Crantwo cenSIM >> is field results. The search parameters are adjusted so that exactly_ Fig. 24 and Fig. 25 show that However, another run called troids are chosen for every query. ROCCHIO and RAND0M2 >> ROCCHIO . NEWROC made by using RocchioTs clusters with the same centroid definition that is used for SIM and RAND0M2 . The centroids for Rocchio?s clusters are defined by using all the concepts and taking the sum of the weights; I.e., if concept two documents within the cluster with weights of weight in the centroid is x. + x y . x and x i occurs in , then its The centroids for the clusters in this study are defined as the top percent of the profile vector; i.e., the For the ADI experi- weights of the concepts are equal to their rank values. ments a minimum of 50% of the concepts are used. from Rocchio's weights in two important respects: The rank value differs 1) the weight of a concept within a particular document is ignored in computing the rank value; i.e., all concepts in a document are assigned the same weight; 2) the magnitude of the difference in the number of documents within the cluster in which a concept occurs is ignored; i.e., if concept and concept i j is ranked first and occurs in 10 documents, is ranked second, then it doesn't matter if X 1 1 1 - <> 1.0 .9 .8 .7 o A • SIM R0CCHI0 NEWROC c o o .6 a .4 .3 .2 & :8: L :& 0 J_ _L J L J o .i .3 .4 .5 .6 .7 .8 1.0 Recall ADI Rocchio Comparison Fig. 24 XI 1 [ - b b 1.0 .9 .8 .7 .6 .5 .4 .3 .2 .1 0 • o a RAND0M2 (C.R = .32) NEWROC (C.P. = . 2 5 ) ROCCHIO (C.P =.24) Cranfield Rocchio Comparison Fig. 2b XIII-57 concept j occurs in 9 documents or only 2 documents, its rank value is still one less than the rank value of concept i . Fig. 24 shows that NEWROC > SIM >> ROCCHIO , while Fig. 25 shows Thus, in both cases the results improve RAND0M2 > NEWROC >> ROCCHIO . greatly over the Rocchio clusters simply by changing the centroid definition. However, the present classification algorithm still performs better than NEWROC for the Cranfield results, although NEWROC does better for the ADI results. The final external evalution is made by comparison with the results from a one-pass clustering algorithm. results are presented elsewhere; [5] A description of the algorithm and the unfortunately the results are misleading. The best set of one-pass clusters is chosen CMDJ - 0% overlap) and plotted against MDJ . OVRO using the new evaluation measure. Fig. 26 shows that OVRO = However, C.P. = 0.50 for OVRO and C.P. = 0.71 for MDJ. Both searches are made using a minimum of two clusters per query. The reason for the very high correlation percentage for the one-pass clusters is due to the size of the clusters. MDJ contains 14 clusters, but the three largest clusters, which are usually chosen in the search, contain 72% of the documents; seven of the clusters contain only one document. Another run is made called MDJ1 Even in this case C.P. is where exactly one cluster is chosen per query. quite high — =.46, but now OVRO > MDJ1. It Is clear that the one-pass algo- rithm needs to be modified so that the size of the clusters does not vary so much. Perhaps an additional pass should be made to break up large clusters and to merge smaller clusters. XIII-58 • OVRO (C.R = .50) A MDJ (C.R = .71) o MDJKC.R = .46) ADI Fast Cluster Comparison XIII-59 5. Conclusion The multi-level classification algorithm presented in this study runs in time T = k-N-p-log m , where m k is a constant, N is the number of documents in the collection, and p is the final number of clusters desired, is the number of clusters produced at each level of the algorithm. p is to e , the faster will the algorithm It Is shown that the closer operate. The complete multi-level algorithm Is not yet implemented, so p = m = > T = k•N*m . With m = / N , that all the experiments are run with 3/2 T = k*N ' , that run m This Is, of course, much better than classification methods N 2 , but it is still not satisfactory for time proportional to very large collections. For these collections it is necessary to implement p . With N = 10 , the entire algorithm, and to run with small values of it is theoretically possible for if once again 2k*10 7 . R T to equal k*-10 *3*log m , where p = 3 . m = / N , then T = k*10 *31og 10 = k*105*3*6.3 = k-106-18.9 = and This Is much better than using only one level, where p = m 3 T = k*10 -10 = k-10 Q . Fortunately, many of the input parameters which yield the best search results also help to lower the constant of proportionality k . Table 2 shows that LSE60, the best cluster run, took approximately 2.3 minutes for the 200 document Cranfield collection, while LSE20 took over 9 minutes! The high percent of loose clustered, and the low amount of overlap both reduce the clustering time. One of the major problems with the algorithm Is the failure to satisfy the requested amount of overlap. a This is due to the fact that the parameter Recall that a is reset to if ())>X0VER . is not changed enough after each cycle. a + (l-a)-CXOVER-(t)) if ( XOVER J The parameter every cycle, b b is set to 1.0 at the beginning of each iteration. is reset as follows: After b = b- [1 + J XOVER-Cj) | ] Thus, b ranges between 1.0 and 2.0. A comparison with Rocchio's algorithm shows that the method used to define centroids is very important. Certainly the rank values prove However, it is not clear whether to be better than the sum of the weights. the rank values are better because they ignore weights within documents, or because they ignore the magnitude of differences in the number of documents in which each concept occurs. This can be decided by summing the weights as Rocchio does, but instead of using this number as the weight, by ranking the concepts according to their sum and then calculating the rank values to be used in the centroid definition. All the evaluations performed in this study are done by visual inspection of the document level recall-precision graphs. This rather inexact method can be improved by using statistical tests such as the sign test and the t-test to compare two curves. Routines are being programmed to perform such tests, and they will be used in the future. Finally, all of the conclusions and evaluations are based on results from an 82 document and a 200 document collection, containing 35 and 42 queries respectively. These results should be supplemented by experiments XIII-61 run on the 424 document Cranfield collection containing 155 queries, and eventually on the 1400 document Cranfield collection. 62 References [1] R. T. Dattola, A Fast Algorithm for Automatic Classification, Scientific Report No. ISR-14 to the National Science Foundation, Section V, Department oComputer Science, Cornell University, October 1968. L. B. Doyle, Breaking the Cost Barrier in Automatic Classification, SDC paper SP-2516, July 1966. S. Lw Worona, Query Clustering in a Large Document Collection, Scientific Report No. ISR-16 to the National Science Foundatior. , Section XV, Department of Computer Science, Cornell University, September 1969. K. S. Jones, D. Jackson, The Use of Automatically-obtained Keyword Classifications for information Retrieval, Final Report, Cambridge Research Unit, Cambridge, England, February 1969. V. P. Marathe, S. L. Rieber, A One-Pass Clustering Algorithm, Scientific Report No. ISR-16 to the National Science Foundation, Section XIV, Department of Computer Science, Cornell University, September 1969. J. J. Rocchio, Document Retrieval Systems —Optimization and Evaluat ion, Scientific Report No. ISR—10 to the National Science Foundation, Harvard University, March 1966. [2] [3J [4] [5] [6]