1-1 I. The Cornell Implementation of the SMART System D. Williamson, R. Williamson, M. Lesk Abstract The systems organization of the SMART programs is discussed as implemented for operation in a batch processing mode oa the IBM 360/65. Covered in particular are the basic input and text analysis routines, the document clustering programs, the search routines and the feedback operations. Sample computer output is shown in each case to illustrate the operations. 1. Introduction The SMART system is designed for the exploration, testing and mea- surement of proposed algorithms for document retrieval. The system takes documents and search requests in English, performs a fully-automatic content analysis of the texts, matches analyzed documents with analyzed search requests, and retrieves those stored items believed to be most similar to the queries. The request authors (users) can submit information to improve their queries (relevance feedback), and this information is used by several experimental procedures to improve search results. The time required to match large collections of documents to requests can be reduced by grouping these documents (clustering) and matching requests against a representative of the entire group. Finally, exhaustive evaluation procedures can be used to ascertain the effectiveness of various methods used in searching. Several important criteria are incorporated in the implementation 1-2 of the SMART system. [1] The requirement for mixing different processing methods, such as clustering, relevance feedback, and searching, implies that the programming system should be written in terms of many small blocks, in such a way that any one process would be synthesized by assembling several, blocks into one unit. In this manner, not only can a process be carried out using many different combinations of methods, but a change in any part of the system does not require major alterations of the other parts of the system. The fast processing speed necessary to process large collections is gained by making it possible to process several queries in parallel. 2. Basic System Organization The SMART information retrieval process can be divided into five basic sections: the input of printed text, the grouping of documents for searching purposes (clustering), the selection of a group of documents to be searched, the searching of the document group, and the evaluation of the search. The printed text specifying the queries and documents must be converted into a form more easily handled by a computer. For this purpose various automatic language analysis devices can be used which reduce each query and document to "concept" vector form. To produce fast searching algorithms, documents can be grouped into classes of similar documents. The grouping (clustering) is done by placing documents containing similar concepts together, into the same group; a representative central item is then constructed for each group. The search of a document group (cluster) is done by first matching requests against clusters. Certain clusters are picked as most likely to 1-3 contain documents of interest. These documents are then searched in the After seeing some retrieved documents, the normal manner, one item at a time. requestor can modify his request, either by physically changing it, or using the requestor's relevance assessments to automatically modify the query. Several measures of retrieval performance are computed to evaluate each search. The sign test, T test, and Wilcoxon Rank Sum test are also used to determine the significance of the evaluation measurements. A) Input of Printed Text The first section involves the reading of text (e.g., abstracts, queries) and the conversion of a given text into numeric concept vectors with weights. The conversion process may involve the use of suitable dicAt present, A more tionaries, thesaurus, and other language normalization aids. a relatively simple PL/1 program is used to implement this section. flexible Fortran IV program Is planned for later implementation as described in report ISR-14 [2] and included in the system flowcharts, part 4 of the report. The presently available text-handling program, LOOKUP, is a procedure which performs dictionary lookups on a large IBM 360-series computer. It accepts a dictionary, suffix list, and texts and produces "conWords missing from the dictionary are also cept vectors" for the texts. processed. The algorithm is essentially that of Sussenguth [3] although LOOKUP is designed primarily the tree structure storage format is not used. for ease of programming, and is coded entirely in PL/1. The overall operation of LOOKUP is divided into three parts. First, the dictionary and suffix list are read into memory, sorted alphabetically and necessary initialization is performed. Secondly, text is read in, divi- 1-4 ded into words, and the words looked up in the dictionary and suffix lists. Third, the concept numbers derived from the words in each document are sorted and condensed into a properly weighted vector. printed and/or stored in machine-readable form. The vector can be The lookup program finds a match between an input word and a dictionary entry under the following conditions: 1) 2) the word exactly matches a dictionary entry; or it matches a dictionary entry with a final n e n dropped and a suffix beginning with a vowel added; or 3) 4) it matches a dictionary entry plus a suffix; or it matches a dictionary entry with a final "y" changed to 5) n i n and a suffix added; or it matches a dictionary entry, with a final consonant doubled ana a suffix added. When several possible matches are found, the match involving the longest stem is preferred; within stems of the same length, preference is in numerical order as above. Thus, if "cop", "cope", and "copy" are all stens in the dictionary, and all normal English suffixes are included in the suffix list, "cops" is found from "cop" under rule 3; "copes" or "coping" is found from "cope" under rule 2; "copying" from "copy" under rule 3; "copies" from "copy" under rule 4; and "copper" from "cop" under rule 5. Other morpho- logical features of English are not recognized; such word pairs as "mouse'' and "mice", "sing" and "sung", "fight" and "fought", or "court-martial" and "courts-martial" must be entered explicitly in the dictionary if both members are to be recognized. Special rules exist which specify that all sterns must be at least three letters long (to avoid, for example, finding "wing' 1-5 from "we" under rule 2 or "inning" from "in" under rule 5 ) ; furthermore, all words are truncated at 24 characters. The program can distinguish titles from the body of the text, if asked; and it may either split the weight of an ambiguous word among its concept numbers, or weight all concept occurrences equally. The suffix list may be omitted from the lookup, in which case only words that exactly match a dictionary entry can be found; and the programmer may choose whether hyphenated words are to be considered as a unit or as separate words. the previous SMART implementations, concept As in numbers of zero or concept num- bers of 32000 or more are considered to be nonsignificant and are dropped from the vector. Fig. 1 shows a typical output of LOOKUP. First the title is given, The resulting and then the text of the document (or query in this case). numeric concept vector Is next printed, consisting of pairs of concept numbers followed by the respective concept weights (for example, concept 927 with weight 12, 2574 with weight 12, etc.). tor in increasing numeric order. Concepts are listed in the vec- B) Document Clustering for Search Purposes At present two clustering algorithms are in operation at Cornell — CLUSTR, which uses Rocchio's clustering algorithm [4], and DCLSTR, a variation of Doylefs clustering algorithm. [5] Rocchio's clustering algorithm is based on the following methodology: an unclustered document is selected as a possible cluster center. Then, all of the other unclustered documents are correlated with it, and the document is subjected to a density test to see if a cluster should be formed around it. The density test specifies that at least N documents should have corre- r\j|r\l *v ^J ? O IT NO rvj a m < X \ 2: <\j!f\J ^ !a uo LL ; or 2 * O JJ T U- X c£ on * iuu • LU •-' • » ^ "v rg 0 t O — t 2^ >x k2 3> *-* H- o o o! nri • -J X << ;3T < , z 1 2: a cU o h- < "> ! 2: »—4 < LU 0 •—» c^ * o ^c 0 r^ 0 ^ +• • * : — QL >O —* u V. —1 Xk, —4 < uo !z ~ .5 > a: '^> (V r\j b LL oO LU -J OJ LJ a < < LU w j »< X 1 ~ h- s X) ^ O 0 ^ ro 'kT\ ^ in * DTI .—. A^ h X C£ LU oc o .J a k/>|(%J LO , \~ < LU LL i LU 2 : oc ^: Oi LU G oo f\j|f\l n j o a: u > »—< < » ^ v. ^ rvj M i-4 f-4 < 00 a Lu p 0 • ac'K — O - J u:2a; a u j Q oo - J o o z < r*» ho ro o cr;o in H- X 0 0 »-M 0 0 OJ 0 0 OO h - * « 5 . k— N S — ~ » .> 2: - J LU < oo < a C* LU < *-• o *- *-i UJ.O Q!LL 00 ??!«£ JJ *-« !•—* 0 00 Q xjac M rO CD C ^ >f T -* cr - J - J 2 : 00 > »—1 < »< -1 OO < O •—• oO OO -0 ^n •t < »—< O Z ro > •Ul 2T • OO CO . J a z a— oc Orlco Si o -J J> i«— Z QC r- Q ! ^ •-• O •- #-*» H- ^> a X LU 00 O LU — O «-•(/) < ^4 LLlLU * 1 o z o a: a O UJ — O Z > *- or 0 ~ < 2: >^ LU X w > LU C2 JJ > » ^ O ct 1-7 lations higher than a specified parameter and that at least Cp^ N p with the document in question, p documents should have correlations higher than is generally larger than p ). This test ensures that documents on the If the document passes edge of large groups do not become cluster centers. the density test, thus becoming a cluster center, a cutoff correlation, p . , is determined from the cluster size limits and the distribution of The cutoff correlation becomes p if fewer documents p. . If more correlation values. than the minimum cluster size (M ) have correlations above such documents exist, the cutoff correlation is chosen at the greatest correlation difference between cluster size. A classification vector is then formed by taking the centroid of all the document vectors having correlations above p . . This centroid vector M adjacent documents, where M is the maximum is matched against the entire collection, and the cutoff parameters for cluster size are recalculated to create an altered cluster. As a result of this process, some documents may appear in more than one cluster; and some which were in a cluster when the centroid was originally formed may not remain in any cluster. These documents, as well as those which failed the density test, are termed "loose", and those within the cluster are termed "clustered". This entire procedure is repeated with all unclustered documents, the first pass terminating when all items are either clustered or loose. Figs. 2, 3, and 4 Illustrate the formation of a cluster. Document 2 is first correlated with all previously unclustered documents in the collection (9 documents of the 82 documents in the collection had previously been clustered around document 1 ) . The correlations are ranked, and the ranks, docu- 1-8 CLUSTERING ABOUT DOCUMENT RANK RANK DCC CORR 1 6 11 16 21 26 2 68 12 34 50 19 77 53 26 42 21 43 C 0 0 1.0000 0.2512 C.1990 0.1697 C.1445 0.1239 0.0934 C.0854 0.0583 C.C460 0.0394 C.0337 C.COOO 0.0000 C.OOOO 2 7 12 17 22 27 32 37 42 47 52 57 62 67 72 2 CCC 64 61 55 33 82 6 81 78 58 15 35 36 0 0 0 CORR C.4CC2 0.2475 C.1867 0.1689 0.1420 0.1235 C.C934 0.0748 0.0578 0.0454 0.C385 0.0335 0.0000 0.0000 0.0000 RANK 3 8 13 18 2i 28 33 38 43 48 53 58 63 68 73 27 18 14 22 48 30 32 25 38 49 54 44 0 0 0 DCC CORR Jl 36 41 46 51 56 61 66 71 0.3631 0.2367 0.1361 0.1634 0.1400 0.1006 0.C921 0.0729 0.0539 0.C43 7 0.0374 0..C283 0.0000 0.0000 O.GOOO RANK 4 9 14 19 24 29 i4 39 44 49 54 59 64 69 39 29 66 16 9 8 67 75 7 80 63 0 0 0 DOC CORK 5 10 15 20 •25 3C 35 40 45 50 55 6C 65 70 1 S RANK 41 71 73 69 23 65 17 24 10 79 59 0 0 0 o ouc 0.2623 0.2174 0. 1715 0.1511 0.1257 0.0934 0.0880 C.0665 0.C467 0.0417 0.C347 0.0000 0.00 00 O.OCOO CORR 0.3466 0.2258 0.1749 0.1515 0.1257 C.0934 0.0891 0.0691 0.0511 C.04 3 2 C.0353 O.COCO 0.0000 0.0000 DOCUMENT 2 M S PASSED THE DENSITY TEST CUTOFF WILL BE CHECKEC. The Testing of Document 2 as a Possible Cluster Center Fig. 2 1-9 Q M ^ N ^ M ^ i T i ^ r v ) ^ < o rr in iP H o o o o o o o o o o o o o o o o u o M J * O ^ ^ 0 0 f l 0 O M H ^ O O ^C >^ 4 -> cu < o o Q TJ a) •fCO ' I cuj r f\i rsj r\j rg *-4 — —i ^ o 4 o in r- o m i/> in,co x ro cc ^ ro ^ oc inn^>TMOoo^fNjHcrcDNOinfn <-> o O o ^co*-«r-ao*rr-o^coaoc\)r*-a>r\j a: vO v OD f\J ro H c >fr ir a in r^ rr >f *-• 0 0 0 o . i - i HI a o o c n » - ' 0 ~ m > t H Q cr N in >t >t>fr<^c\jPvJ*-«—*p— — « r - » 0 ooooooooooooooo o a Q a o o o o o o o o o o o o o u c a ^ H M r. m c r m o o o r o r ^ - o ^ m r ^ rC I ' •l I >r oo in C cr >r c o m u> r\j r*. r- p- >r ^ M >Oi*-4 incvj ~* m r- >f >o ro m n ) •. I o ',1 f\J ir- Psi r- -^ eg h» c\j r~ C\J h» < \ j r- OsJ ~ l r\j r\J *o r* *r v t i n m NO r* C\J o r- m o i r o m o m o i n o m o m o •-i^cNirsimro^T^cninoof** a cu a o r, o •i i •H w fU a o r- a* *-4 o O f \ i m >* o r~ < \ j ao - o m r\j \*> i n & oo cr sO GO 00 cr og o ro JN oo fVi sO ao i n m o oo >r ro .—• o o ^0 i n ro m 0 < \ j C\J — « • • 9 r - « — o o o O • o o o O o o o o • • • • • o o o 1—I f—4 N r o i n o ^ o ^ ^ ^ ^ ^ m N M oc o 2: H ro s0 m r I (1) m >t <\i oo <\J *-* • , ^ ^ ,_* o ' O o — » - ' O ^ i T i ^ ^ c ^ M ^ ^ M o o o O O O O O O O O O O O O O O O c* « i o >o in oo v p - l O s j i n ^ v O ^ ^ — C m o ^ vO^ooaornooooNfrror0 o ' ) • a) rJ: I - CM o LJ Q i oo o m r- >r o m m r-4 m GO r* rvi! r* i ro >r NO o >o -r m . - 4 vt in i n sf m f-< o < ^HMM(^"Ovf>tinin>o^ a: < oc 1-10 O O ~* ~* co c\i f\j m of N NO i—i r\J *o cr f\J D II ^ ^ r O ' N J ^ P O r o ^ r o — < ^ o o o o h- oorvjr\jsOcoror\ir\jf\J(Vi oo h- X o ^ -^ ^ rsi -\i r\j >t O »—• •—i ^-4 *-» r\i f\j pr, LU o o • r- CD II h-oaooocococoro^ro^sO HTS o M o u QLU uO h- 00 43 00 00 CD C\l r-4 ^H O (^) rvj^NOGorgvjnoN^TnLn ^ 4 ^ 4 * ^ 0 s | (\J >^ O ^ ^H P-J < (\J f\j LU < CJ fV UL .^ r^ C D a 7* » >rc\jr\4C\jco<\J<\jr\4CVJ(\j<\j — ted C C D SUM c LU UAR o UO \' 3 H Lfl o>oro^>r <^J > ^ UL H H P < IV (\J 0« ^J p -*(N > t,^(^ ^t 00 *-*' >0 0 s _| ,—1 fSJ f \ j f\] < X LU o :r oo hO rsj o o o o o Q O H- >fr >fr <\j r\J *o "* >t og, OC\if\Jtoo O ac < oo h- > v0 a (X CM o ^) QL F-I tQooo-^r»->crn ^ H aU L o CM ( M f t ) O P-I H (\j rvj r\j LU o o 2 •* — X rg LU EVA Com Q) 1-11 ment numbers, and correlation coefficients are listed in Fig. 2. In the example, at least 10 documents (N ) must have a correlation greater than 0.15 (p,) 5 and at least 5 documents (N ) must have a correlation greater than 0.25. The correlation of document 2 is larger than 0.15 for 19 other Docu- documents, and for 5 other documents the correlation exceeds 0.25. ment 2 therefore passes the density test. therefore p . M in this example is 5, and is calculated by finding the greatest correlation differ- ence between adjacent documents, starting with the document of rank 5 (at least M documents must be included) and checking differences up to The largest gap occurs between M documents (in this case 15 documents). ranks 7 and 8 — therefore p . r min is taken to be 0.2475. The classification vector (called the centroid) is formed by merging the document vectors of documents having correlations above p . (0.2475). This The centroid, composed of concepts and weights, is shown in Fig. 4. centroid is then correlated with all previously unclustered documents (Fig. 3 ) . A second cutoff correlation belong in cluster 2. M and checking until p . p . is calculated to determine which documents Here the greatest correlation difference (starting at M ) occurs between the documents ranked 11 and 12. Therefore cluster 2. becomes 0.3859, and the top 11 documents are included in These documents are listed as the "11 Relevant" in Fig. 4. The following descrip- DCLSTR uses a variation of Doyle's Algorithm. tion of the algorithm covers the main points. [5] set is arbitrarily partitioned into documents in cluster concept vector C. j . m Assume that the document S. is the set of clusters, where S. Associated with each set F. . is a corresponding and frequency vector The concept vector consists S. , and the frequency in which each concept occurs. of all the concepts occurring in the documents of vector specifies the number of documents in S. 1-12 Every concept in C. is assigned a rank according to its frequency; i.e., concepts with the highest frequency have a rank of 1, concepts with the next highest frequency receive a rank of 2, etc. value), every concept in D. Given an integer b (base is assigned a rank value equal to the base value The vector of rank values is called the pro- minus the rank of that concept. file P. of the set S. . Fie. 5 illustrates the concept and frequency vec- : u m clusters, the tors, and the corresponding profiles for a sample document collection. Starting from a partition of the document set into profiles are generated as described. Every document m d. in the document space is now scored against each of the where g(d.,P.) profiles by a scoring function g, d equals the sum of the rank values of all the concepts frDm C. . which occur in Fig. 5 shows the results of scoring the documents in the sample collection against the profiles from Fig. 5. A new partition of the document set into by the following formula: S. - {d. |g(d. ,P.) > T.} H. - [a-(H.-T)] l i m+1 clusters is then made 1 < j < m H. > T l if T. I T otherwise where H. = max(g(d.,P.)) l ° l j 0 < a < 1 _ _ T = a is the given cutoff value Those documents which do not fall into any of the m clusters L . S. are called loose documents, and they are assigned to a special class The process is 1-13 S d l l 3 5 C l l 2 5 7 8 F l p i S 2 2 4 C 2 l 2 F 2 P 2 S 3 6 7 C 3 3 F 3 l d 2 l 2 4 5 d 3 \ C C C C d 5 l 8 d 6 3 "Si C C "pTI 5 5 5 d d d C C C C C 3 1 1 1 2 5 3 3 3 4 d d C C 2 2 1 1 2 5 5 4 4 5 d d C 1 J 1 I 1 C C C C l 7 l 2 3 5 C C C 6 8 C C 6 8 2 C C C ° 3 °4 < C !cs °8 5 a) Documents b) Initial Clusters, Profiles, and Frequencies Construction of Profiles from Documents (base value = 6) Document Profile of Highest Score 2 2 1 2 1 3 3 Score d d d i 2 3 15 19 12 19 9 5 10 \ d d d FT 5 d 11 d7 L d d 5 6 7 3 d d i 2 5 6 b) Resulting Clusters One Iteration of Doyle's Classification Algorithm (cutoff = 10) Fig. 5 1-14 now repeated after replacing S. P. by P! . The iteration continues until S! = S. (actually S*! = S* , satisfies the termination condition that S* ] is the subset of S. : where consisting of all those documents that score highest against profile P . ) . Basically, this algorithm matches documents to existing clusters by computing a document-cluster score for each document with respect to each cluster, and placing a document into those clusters for which a sufficiently high score is obtained. documents. The clusters are then updated to include the new In each iteration all the documents are correlated with all the clusters, and the clusters are updated until further updating does not alter the group of documents in each cluster. form in Figs. 6 and 7. This updating is shown in list The 12 profiles (clusters) of Fig. 6 are matched against the documents, and updated to become the profiles of Fig. 7. It should be noted that the document clustering process can be extended to the clustering of clusters. algorithms generates m That is, if one of the two clustering m n groups could be clusters, where groups of documents, these grouped together, as if they were documents, into 1 <_ n < m . _ These n clusters could then be grouped together, and so on, At present until a hierarchical cluster tree is formed as shown in Fig. 8. no routines for automatically constructing such multi-level cluster trees exist In the SMART system, although such an algorithm is planned for implementation in the near future. Both CLUSTR and DCLUSTR generate the first level of the cluster trees, thus representing special cases of more general tree construction routines. C) The Selection of Documents to be Searched First a search query is The search process consists of four steps. 1-15 O in rO o 0> Al X X) O M o rsj .n <> x o t o o - 4 ~4 —4 -4 -*i r-^ >• JO J) JO uV *n —4 o > N. -4 -«4 J0 --4 ^ o -4 iJJ ^ >t JO UJ n JO LU ^NJ JJ * * • < -• m X o£ < s\ < ^: 3 X X -4 ^ in ~4 JJ < X) r a> < o ^: m --4 3 —4 JJ X r^ O LJ < ~* o rx: 3 JJ h-J JJ 3 J. ~4 -4 -J LL ^1 —4 J0 J J rs. 3 X ao x> ^4 JJ 3 3 LU o JJ O ^J - ^ "NJ JU S\ hi) •H 3 X 3 X c; x 3 a <£ * J0 3 X 3 X a £L Jl LL sf *~ h- 3 X 3 3 - 4 a. T\ •o a. ^ —4 —4 , 3 aC og X K<- sO in —4 J X •7* a . •n X 3 x m •—4 3 a. n hLU z. no uO ^ h- ->: -> ^o >r rO Z -^ LJ 51 3 3 . "O 3 O O -4 JJ -4 »-» X) "Z ^ o *• o 2 OJ 2L O ;*- r^ 3 h- s> M O 3J JJ •+s >J> o0 ^) hLJ oO u ^ UJ > o — JJ ii r\ 3 ^r » o — 2L JJ Ln ^ ^ 3 --4 ^ ^ m m - 4 ^) o •— wT> o ?z JJ JJ ^ t 3 J 3 3 3 3 JJ H» T ^ 3 J D 3 J I'O ^ 3 i 3 ; T m M in "^ 5: 3 O 3 3 JJ O ^-4 3 3 00 3 >f Q J -r* E *— -4 KM h- E p— 3 O ^ ^. ; mm O r^ \» f\ E — » ^5 ^i2 E i— >j O r— K* ^ 3 M ^ M ^ ^ 4- lies (Clus ters 4H o o o-4 o ao ^4 ~ -n r\ 3 0 "PUT ^ < 1-16 Iht o7 U lrtc co Int 7:> iHt o2 THc ol Tiic ZS Tri<19 IHL Jl Tht -to The oU |nc io<: Triu <+{ OuCUMErtTS ol 83 UGCUME^TS 66 7U UuCUrtEiMTS 76 77 OUC-L.ME.MTS IN PROFILE 84 67 1 AKt 10U AKt 128 IN 64 PROFILE 2 86 124 PROFILE 112 134 3 IN 74 AKt 161 AKt IN 9 116 147 IN PROFILE 4 i5l 197 PROFILt 113 120 PKOFILt 115 121 PRUFILt 5 3 DOCUMENTS 90 110 OGtUMtNTS 41 11<* JUCUrttnlTS 39 UULUiMtiMTS 57 t>8 DOCUMENTS 132 133 DUCOMEiNlTS IDO lt>u DUCUMEiMTS 163 lo4 JUCUME.MTS «*6 49 AKt 181 AKt lb3 ARE IN 6 18 192 195 IN 7 2 IN f PRuFlLt b 187 186 9 ARE Z6 II^ PKUFILE 158 159 IN PKUFILE ARb 179 AKt la9 AKt 167 ARE 185 10 t>o 182 IN 164 198 ti PROFILE l l l£>5 Ibo PRJFlLt 12 168 IN 4o Updated Profiles (Clusters) Fig. 7 1-17 cluster root <^S centroid d document /i\ ddd n / i \ //\\ dd ddd d d d d /\ dd f\ f\ r\ dd dd dd A H y p o t h e t i c a l C l u s t e r Tree Fig. 8 1-18 defined, either using the authorTs original query, or a modification of the original query, in numeric concept vector form. One important modification consists in using documents judged relevant by the author to modify the original query vector. This process is known as relevance feedback and is dis- cussed in part D of this section. Once a search query is defined, the set of documents to be correlated with the query is selected. SMART provides two options; either a full In a full search every document in In this case the selection search or a tree search may be made. the retrieval base is correlated with the query. of the documents to be searched is trivial — all documents in the collection are searched. In the tree search [6] , a set of documents is selected by a cyclic: process, using a tree such as that pictured in Fig. 9. At any one time, a set of active nodes exists in the tree; initially this is the set of roots; (the highest level of clusters). with the search query. Each node in the active set is compared The "goodness" of each node is defined from the relatedness of a query to a node, and from other information about the structure of a tree; the nodes of the "active" set are then ordered by this value of a "goodness". promising. A subset of active nodes is selected as being most: The corresponding nodes are deleted from the active set, and the sons of these nodes (if centroids) are correlated with the query and become a part of the active set. Those sons which represent documents are: then entered onto a list of documents to be used in subsequent correlatior.s with the query. The active set is cyclically reordered and another group of nodes is selected to have its sons examined until some desired number of documents are located. The process used to obtain a list of specific docu- 1-19 At s t a r t H d d d dd d d d d d d d .2 O O O 00O dd dd dd dd A f t e r one c y c l e D E D y\\ d d d dd d d d A f t e r two c y c l e s /\ d d d d dd dd dd da o oo o o o D :7\ Z6\ dd //.i d d d d d d d dd 0 5\ dd • S.2\ dd //.I dd d d d After t h r e e cycles D DDD 0 5\ dd dd D D D d d d d • X2 dd DD d document o oo o dd I root D \ ycentroid selected document Legend: The value of the goodness of each centroid in the active set is indicated inside that node. A centroid selected for expansion is indicated by a doubled border. Searching a Hypothetical Tree Fig. 9 1-20 merits to be directly compared to a query is represented in Fig. 9. The listing reproduced in Fig. 10 shows an example of input to the cluster searching routine. The first iteration (Iteration 0) uses a full The second iteration (Iteration 1) repre- search instead of a tree search. sents a tree search on the cluster collection "CENTROID NO MORE" using the "COSINE" correlation. The desired number of documents to be selected for correlation with the query is given by "WANTED", where "WANTED" is defined as: WANTED = "CORDOC" + "TIMALL" * "ALLOF" + "TIMREL" * (the number of relevant documents not yet retrieved) + "TIMNMR" * "NOMOR" where CORDOC implies that at least "CORDOC" documents will be correlated in this iteration; TIMALL "TIMALL" times "ALLOF" (for this iteration) documents are additionally correlated in this iteration; TIMREL "TIMREL" times the number of relevant documents not yet retrieved are additionally correlated in this iteration; TIMNMR "TIMNMR" times "NOMOR" (for this iteration) documents are additionally correlated in this iteration. "ALLOF" and "NOMOR" are user-supplied constants indicating how many documents are used in relevance feedback. Therefore the second and fourth terms of the parameter "WANTED" are constants, like "CORDOC", for a given iteration. These constants are expressed by three parameters (rather than 1-21 o o «M * *» o o o 00 O O*1 z o en r- r- ao r- a. a. I m oo | o — o o c z D z a M *- >oi~< >* rI i w C D T) I > _| O O p |o :< ! \ | O s: a. a. a a. -i < z -> > OJ a a: a: a o C •H S J -• ^ c o T a m CD CO > 0 0 LU • r-4 CO u O •H [ • a z a a: a o oo H o T O hi) •H 3 Z Q •H cy CO 14) In a U. Z > ; > 3 O -J -J _j u. z ! u . 'J 3 -2 - I O v j i v j J 4 J. Z ' U . 0L Q. Q. a - l O . (/) u (\) 4-> a3 o Q W •> a) o a J o o o o o r i-«0 O OlO r, m c o j • • • •' • o •H a a) LU :> 3. < r-j -3 I 0) CO x o CJ Z < a - l O -J I-I Z J U . ZIU. z >!» IS 1° Z Z I O ,u < o o l o |!3 z 3 3 •-• >< QC a» OO l/> z a o X o •- u 3 O UJ iW l»— — !< * *-• * 'z z l^ o 3 • ^ >< < 1-22 being lumped into one) for user convenience; most users will set "TIMALL" and "TIMNMR" to zero. The parameter "TIMREL" allows the number of documents searched to be related to the number of relevant documents previously not found. The "goodness" of each node (the parameter value used to rank the nodes) may also be controlled by the user through 17 parameters as follows. "PCN" "GOODNESS" = "COEF" + "MCN" x "CROWN" "PFCFCN" "PNCFCN" + "MCFCN" x "COEF" r r ^ r u i N x "CROWN" r m r u i N + n PLV n + "MLV" + "LEVEL" "MCNLV" x "CROWN" n P N C N L V ? f x t l p p p p ] " ytf rrL IijV "LEVEL"™^' tfpyppT ytt + "MCFLV" x "COEF" " x "LEVEL" + "MALL" x "COEF"" P A L L C F " x "CROWN"" P A L L C N " x "LEVEL"" P A L L L V " where "COEF" is the correlation value (usually cosine) between the node and the query, "CROWN" is the number of nodes that are the sons of the node, and "LEVEL" is the level of the node. For example, the node in Fig. 9 wi'ih a "goodness" of 0.9, has a "CROWN" of 11 and a "LEVEL" of 3. It should be noted that the formula for "GOODNESS" contains many combinations of "CROWN" and "LEVEL", making the formula extremely flexible for experimental purposes. It is expected that most users will use only two or three terms, most parameters in "GOODNESS" being usually set to zero. The size of the subset of active nodes to be expanded (after all active nodes are ranked by "goodness") is determined by additional parameters specified by the user, as printed in the listing (Fig. 10) under "SELECTION" and "REJECTION". 1-23 MINNOD MAXNOD GAP At least 'MINNOD" nodes and not more than "MAXNOD" nodes are to be expanded for this iteration; If there efcist two nodes between "MINNOD" and "MAXNOD" which have a difference greater than "GAP", all nodes above that gap are expanded; EPSLON Any nodes within "EPSLON" of the last node selected for expansion are also to be expanded; MNGOOD Any node with a "GOODNESS" of less than "MNGOOD" Is not to be retained for expansion; MNCORR Any node with a correlation less than "MNCORR" with the query is not retained for expansion; PERCOL Only nodes whose combined "CROWN" is greater than "PERCOL" percent of the size of the collection being searched need be retained for expansion. TIMWAN Only nodes whose combined "CROWN" is greater than "TIMWAN" times the number of documents to be correlated with are retained for expansion. The selection of the documents using the parameters from Fig. 10 is shown in Figs. 11, 12, and 13. The queries are processed as a batch, as examples. The queries are and queries 31, 32, 33, and 34 are shown first matched against the "roots" of the centroid tree consisting of centroids 1, 2, and 3. The results of the matching and other useful statisThe query number, iteration number, number of tics are shown in Fig. 11. documents wanted and found, centroid used to match, "goodness" of the matching, statistics of the centroid, and the cosine correlation are given for each query match against all the roots. The "REJECTION" parameters are used here to eliminate centroids before any ranking is done on "goodness". A "MNGOOD" of 0.10 causes centroid 2 to be dropped from the active set of query 33, and centroid 3 to be dropped 1-24 QC o C O o cr C Ca a c o c o o o o z Z Z — 31 3" X > >> cc cc ao O D Q UJ U J L b a a O CL a a. O O a a. o a. a a o a: a a o a cc 3 Q Q Q UJ UJ UJ cr. co m C C O o o LU L U c c LU L U O Q O LU LU UJ c c CL X X << a LU LU 2 Z << a. a. x x LU L U UJ LU CO CO a o z z r — « O - * f\j f\j cr <\j 00 O >0 LO >t ~* ro >* —• c\j >r IT* * - * f\| CC t \ i - * O f\J LL LU CC CC H- r- u 3 U o o o o a c o o o o o o o c o o o o o r\j c\j r\j r\, U o c r\i f\i .CUM ^ r- Nf >C .-« rc ^ l O v 0 eg CM vO C^ CO _ U 2" -J 0 0 »— Z oo JC U J C CJ a: z -? a Q U >o _ • • •• _ _ _ _ l r-l CC i-O O O J CNi LU —1 v 0 > j r\| —» _ « ~* >o >J- Ovl >t ^ >T >}• LO — i O f \ J sj- r j Z -J Exp nded fd • • •• O O O uo o c c:o H»- o o r- CD h- • - C) L jq 0 4 -> C O C D 0 H r H r- ro m o f\J ^ 1 J — « >jr-< o r-t r\ f\j t\| T ri n ro r\l f\j ^ r-t cr ro >J f \ J Cs . cO x •H -. O - ^ ^H o C3 f\l ^ • " • < MH 0 U^ r: O. rvj ^-4 -• r 1 o T U_ C* Z UJ Q J CO f\j 'NJ < \ , f \ j AjfMfvjfvj f \ j f \ j <\j ( \ j r\j fNJ oj LM ^1 u. o o o o o o o o 3 X u u LU ^ > 3L < Z3 X O LJ r~ <\) f\J fNJ f\j fvjrNfNJfM '\jr\jr\j'Nj o O LJ rZ (\J i^J fN < ' *, ' ' r> r i i m r i I I I I ^ ^ ,,, r i I I I I ! ro ro ro m > -3 I U xxjxx o X OOIOO - 7 ^ 1 ^ 1 7 x x x x o o u 0 1 - 1 , - ^ - r - x x x x O O O H- r- rO r- o X X •2 CCCOCO'D < K < < XiCOCOCO < < i -a < CQcOCOcOl Si CO X LJ QD o c o z z L2 Z X o o o > ~ £E > CD • CC I a a o QL a. a lo la: a a o QL Q UJ CD Q- Q. Ct a o a O QL Q UJ o v0 1° LU CO r LU CC I I ac z u G T I a < 1 oo U oo Q •—< 2 sf ® ! co r-vOoomcocoQo^^x 1 oo Z LJ 00 Q 1 oo Z oO O l _ 4 1 oo 2 CJ ; oo 1 00 ! o oo c LJ LU o o o o o o o o oo o (/) 0 0 LU 2 -J o o oo o o o o o o o o o o o o o S 8 S 5 S o 5 5 oo 6 LU O r^ » — z n or — •* U ^ h^ LU o r^ o Cf •— LU LJ ro _ O O i z •— 1 Z LU LJ a: ! o a: •— z. LU LJ a cc z o :r o a: LJ > o o o —*'<*-* o *o o ^ ^ ^ ^ ^ OP c •. I -I J 0 m! m < 'I ! rvj co O l o fNjm i n > t o c o ^ 5 o ^-im ^ 0 C o r - . - « - 0 ^ • • • • • • • r N , o co ^ Orv. * * • 0 I ' C D CO Mi z Q CD O D O o o o o o o o o o ^>o o (/) :-. ^ ^ mm o ^ r - f ^ x ^ c r o o ^ ^ •H [i, '1 i a an 3 or O o •I I CO 0 CM rvj «M.CM ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ Hi oo LU ^ (), X o o oloo i ZUJ UJ > 3: ^ P( n>|r^ m rn m m m r. n, m rri ni m I CJ x < co CO cu < ^ ( X J C O d D J c O c O C Q a D C D C O C O C Q C O c D c o c O < CD < i 1-26 O* ro f\J O o j * • r^ z o m 00 Z o oc o» o o* *-« >*- o IT r^ r*- cc r^ -H ^ ^ c i - « m sD or i— z c CJ C Q Q C UJ LL UJ Lb U J a i ^ (\J lf> 0 0 —I (\J > * m cc - • IT o o o c5 o _ Ll- u - c c c o a z z z z z < < • < < < UJ LU UJ a *f z •- a a a. a. a . X X X X X << LU C o z z z z z — ~ ~ < o • rvj >*• I -. a - 'U •a x uar cc oc co cc cc CC CC X U_ LU UJ u- U J oc oc cc u - UJ U J UJ UJ CC CO CD CC CC o a o c c a c c c a c a I or Z c rd a, x CO CD X) O r£ Z Z u O 4-> 00 CD X) O O «0 N I ^ N t—i *-* r^ >j- m >* >r «~ f\j cc cr o ^o r\j >r ir» sC c^ O a a 7- 3 z 13 C 3" C £2 DC C/0 O UJ O 0) 00 Q) CD O >0 — —• r\j NO NJ" ^ cc — r» o « >0 C_ a Q a H UH O MH IT CO O <\I O CQ CO ac en cc r- >r o so ^ >r rv m CNJ <\j r\j r\j r- >r o o r\j *f $• C > 7 sO ^ CC CC ^ +J (1) f/J r*- cc ~* o O r\j T ~* ~> TJ 0 •H 00 oo o o o 3 r- cr to o _> o '_) X 3 HM * Q X O jl •— oO JJ -J a o a: > o 3 3D Z V D CD H- O «3 *: o 3 J X a 3 3 IX z 3 z 3 J Lb MI c x. O UJ » — ^ Z> UJ _ J < C O a) •H h - 'NJ 0) fc X 3 •c - 3 uO oO . j 3 T E (3 3 c.y 3 < 3 3 3 3 * at \j 3 3 X 3 O O X <3 OJ oo 3 Z Z O 3 O X ex: C D U x: 0 rd 0) G O " » o bO 7* 3 3 ^ Z V x O o 3 r> 3 e> •H JC r: 0 0) r\j i— X >r -H 5 X f^ 3 — — -» . , M« r\i H- ~g C 0 •H 3 tin •H h, 3 T. u3 u O y^ jj 3 31 u3 OO O h3 U rt O Cf CD 3 to CD CL •- l < 3 * —• Z O 3 O UJ 3 3 X 3 x: z 3 3 —< 3 O k — -"' 3 o o i/) O irst Cons 3 3 4- J r ~ l • o T I I > X a < iX 3 2: X 3 Z < J < < X. X X X 3 3 Z Z CD O h3 3 > JJ CL — 3 /) UJ 3 UJ —• a > 'JJ X — 3 oO JJ 3 U 3 a - . C i a. _j 3 3 — X JJ 3 O — 3 3 kX — — 3 < X JJ VI Z 3 3 n * • •3 3 UJ M < CX OO > o 3 3 3 O Z O O X 3 oO 3 < 3 3 X 3 3 oO - t o o o • LL LL z 3 O » —— X X UJ 3M 3 M LU ft ft z 3 3 r*- H '^ ft -^ vJ* X ft 3 JJ CL < < < X Q 3 Z a: 3 < X O Z < ac. 3 1/3 3 < X 3 Z < < 3 • - rsj fN ^/) cO •- •CL CL UJ U J * Q_ CL UJ JJ 3 O Z l Z r- r- < < 3 3 u_ SUM 3 2 O 3 Z 3 X 3 LO i Z ^ 3 wl j-> —4 rn o CO 0 0 • — t— QU J UJ o. 3 3 3 > OJ —> "T ~4 U Q T 3 o "O JO 3 a 3; 3 - • !\J z z 3 3 3 X 3 CO I i "ft •- o U 0 rr -J -J Z UJ — 'ft 3 | 3 -JiZ Z^co < 3 3 z & -o — • |x h- > JT -I 3 -J Z Z UJ - . QC 3 < < X :/j r — rv X —^ .o < 3 X ac UJ (-••NJ50 ft r^ >r >^O O* o o M 0 ^J X 3 M JC II X. •— UJ Z ac o o •y> •— z x> & 3 r 3 —» JJ 3 3 ^i -« a •— 3 •— (\J a* » — JJ O0 X 3 ;UJ o o o • oO Z X 3 •~t ^ ^ 1 c( 4LH > (U 0 u r: •H C rd D LU JC 'JJ < JJ O ft *" 4^ > >r z ~* LL. > — 3 3 < y> 3 »CL cO *- Z z —1 J•—i o — * • 3 < a »O. cO ~ Z V 3D < 3 3 ' CL o0 LU z •ft f\J CT s f ft UJ 1— 3 _J -< 3 3 —4 • > • • LU O < ^ 3 t » 1^—r g ft Z ro 3 C* 3 JC 0 0 •H 0 c u 4 ' ^ < 4 J > rn Q 1 k~ ^> * o <*, z Z ^ O —t ^ m ^ LH o 3 o >—• —< t— •—4 LU z ^ — 3 -< 3 ^A CD a (/) •--1 M-, ^ Ui, 4-J — 3 ^) < J3 CL CO M »- UJJ >< dD 3 < Q uj 3 CL JJ to r— X a. 3 uO JJ >r •—* — . ac 3 3 Z < a a oo ji 3 ut JJ J« JJ a. »— z 3 o —4 < UJ a 'JJ a »-ft JC JJ UJ ft ft 3 uO > < X •— JJ Q 3u JJ Q. • - > J* r» >r > f — 3 < — * • z >r a: *-• o t— 3 00 >- fM >r X) U_ 3 51 3 J0 —z >- 3 cO < Q < UJ 3 3 LU H- < JJ X 3 X 3 -0 :^ a; ;^ .a 4> H C O CJ ' H C D <—* 3 JJ > 1 ft 4/ C ^ 0 o •>r a. LU JJ 3 ro oc • - •^ * < X z ft < > -^ 'M UJ :> 3 HCL UJ Hat UJ Jf Ci H- < < < > < X 'JJ a . >- 3 to OC UJ Q HUJ JJ Z 3 3 J ^ —4 3 Z UJ •—4 < > < JJ H x: H- ft t r CM 3 ^ O •—4 J J LU ft ft CO c o • - V- ac X O LO > >r — ft TO X UJ 3 13 Z 'SJ 0 0 •~4 _ y> •« •— > 3 OQ X X UJ UJ ft ft -* CO 3 aC O Z 3 ^ — • XX 3 3 JJ z z • - < o 3 > r CT OJ < y) HQ. UJ *-x UJ X X 3 3 • • * »-^ JJ X UJ ft 3 3 Z O --) < cO < X 31 X 3C 3 J. » » CL UJ a. J J • - /> - J »-• TO ft o -* z 3 3 - 4 o 3 z 3 O a. 3 3 »- >t N *N r*» •^ Z -« o o 3 ^r —4) > 3• 3 N > 4 UJ • * *X Z 3 3 o J> ^ w •«• • * »-• 3 < X ^J 3 Z 3 -^ JJ JJ 3 3 X • - < si 3 CL X 3 3 0C 3 X h» X < uO 3 CL X 3 3 3C < ^o 3 t>4 o o CL X a 1-35 o o rvj m 0 v * •* " O o o m o o m o o o ro iT\ o m ao UN rJrvj o >r T O; u-l 1 h- o o o O f^ r«"> O cr >o <\j m o o o o O o o r- rr> o O <\l o r g r\J o r~ >r o to r o m I P Q0 " 0 !M _ i \f\ o > * _*o _ o IT CT O O a o O O o a o O O o o c o o cr r r- >o >o -ro o o o o o o o o o o o o o O r- r- m > * oo CO ao r - p O o O o O > * o IT* m lO o 'Nj N N» h- o o O ?* O O o o * -* -1 o O eg o m o P * vT o ao sO h r - LO ro i/> ITS >J- >0 o s\ f\ O LO. ^4 o — « C* ^ ) -n r g CM ^ r* o r-^ m o o o O O o o o o —* ^ D «t O m gO >0 O «0 43 >0 <0 s >0 <0 >0 ^ O O C o o c c o o o o o o o o o c c o c o o o o o o o,o o o o o o -* mm-*", r^c*. c O r ^ r ^ r ^ O O O O O O O o l o O O O O O l O O O O O O ^r*>rnmmrO>c O g T O O O O O O O O O O O O O OlO C O O O O O r O m m - n r o r n ^ ^ ^ Q Q Q Q Q Q ^ - s c o O O OO 0|0O O O CO O O O C C - 0 0 0 0 0 — « - ^ - ^ — « — « ^ - ^ ^ — • — < — « — —. — — —•—•—''-<'-' X o a o o 3 a) 0 (/) 4 ' p-* O Z >fr^*^.orgmrg gf«^»Mgrrggf»r\rg-*aDirtrgrgrog3 H in Q) O Mi *3 ys **. ^ •». ^ >0 —• rg - • > o (D Q -J r p^ O H aor*-gD>o^^^^oor*"'^Or*ir»«,Mrgr»-ir>^oog3«o 3 at: UJ - ^ C m og rg <\j rg; r\» rg rg rg -^ **\m* r*********^******** T *-»^ ^ ** «^ I-I —«:o or 3 o ooo oo o o o o o o o o o o o o o o o o o o o o o o o o o ^ O N 0 0 > T ( 7 , ' 0 ' f | l f © ( O r N J N N 0 0 i / > > t 0 " l > O >-< 00 O O N r- IT 00 N C N ^ o D i n i r i r ^ ' N j ' t ( D N ^ 4 , - ^ m ^ ' \ j a o - < y ac in s ^ f f l >t in w o (71 0*^Ooo<0(vj^-H^r'NN>OfnWfOrgryj^^-<0^^0*aooooOf* mrorrrgrgrgrgrg^^^^P^^»^--^^«^^-«^p-«^oOOOOOO 3 o • o u o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o 1-36 The last section of Fig. 18 contains various statistics for the run. are defined as follows: These DOC. CORR The total number of document-query correlations performed in the given iteration. CENT. CORR The total number of centroid-query correlations performed in the given iteration. DROP DOC The number of documents with a query-document correlation of less than "MIN CORR". CORR. RANK The number of documents with a query-document correlation of greater than or equal to "MIN CORR". OLD. DOC The total number of documents previously seen by the user. OLD RELDOC The total number of relevant documents previously seen by the user. NEW DOC The total number of documents (relevant and nonrelevant) shown to the user in this iteration. POS. FEED The number of items in the definition of the query with a positive multiplier for feedback. NEG FEED The number of items in the definition of the query with a negative multiplier for feedback. QUERY CORR The correlation of the query used in the present iteration with the original user query. REC. CEIL The recall ceiling seen by the user. The listing of the retrieved relevant documents completes the first iteration of the search. At this point, the user makes relevance judgments, or, alternatively, prejudged relevance decisions are registered, and a new 1-37 search query is constructed using information about the retrieved documents, The user-supplied instructions specifying what information is to be used, and how the new query is to be constructed are taken from the input parameters (shown in Fig. 16 in the second two lines under options for SEARCH 1). The definitions of the parameters are as follows: POS MULT All weights of the relevant documents used in feedback are multiplied by this number. NEG MULT All weights of the nonrelevant documents used in feedback are multiplied by this number. To signify that negative feedback is not desired "NEG MULT n is blank or zero. POS RANK CUT All relevant items with iteration ranks above "POS RANK CUT" according to the ordering of the previous iteration are used in defining the new query. NEG RANK CUT All nonrelevant items with iteration ranks above "NEG RANK CUT" according to the ordering of the previous iteration are used in defining the new query. POS CORR CUT All relevant items with a correlation above this value are also used. include a decimal point.) (This value must If "POS CORR CUT" is zero or blank, no relevant are selected due to this parameter. NEG CORR CUT All nonrelevant items with a correlation above this value are also used. POS ATLEST At least "POS ATLEST" relevant will be fed back (if they exist — i.e., more remain to be found), NEG ATLEST At least "NEG ATLEST" nonrelevant will be fed back. 1-38 POS NOMOR However, no more than n POS NOMOR" items will be searched to provide the n POS ATLEST" relevant documents. NEG NOMOR However, not more than "NEG NOMOR" nonrelevant will be used. Note that only documents scanned in an attempt to locate relevant documents for positive feedback are used in attempting to find nonrelevant for negative feedback. UNLESS Negative feedback is done except when "UNLESS" relevant documents are found. If "UNLESS" are found, no negative feedback at all is done. To signify that no negative feedback is desired, "NEG MULT" should contain blanks or a zero. Should ! UNLESS 1 be left blank or set to zero, negative feedback is attempted regardless of the number of relevant actually used in positive feedback. STOPALL "STOPALL" is set to T YES T if the user wishes to stop considering documents for feedback once all the relevant documents have been found. If set to ! N0 ! T documents will be considered until the specifications of the other feedback parameters have been satisfied. is PREC CUTOFF T The default N0 T . If the precision after "POS RANK CUT" documents is over "PREC CUTOFF',' and if the precision after more items are judged drops below "PREC CUTOFF", the judging of documents ceases. POEFIN 1 1 T SILENT1 if search queries are not to be printed; STANDARDT if search queries are to be printed; DETAILSf if details of the search query defini- tion process are to be printed (used only for debugging). 1-39 Using Iteration 2 as an example, the new search query by the following equation: Q. is defined n r ]=1 where n s ]=1 Q. , - (1)Q. + (1) I (r.). - (1) I (s.). (r.). designates the concepts and weights of relevant document (r.). ; (s.). designates the concepts and weights of nonrelevant docu1 " J i ] ment (s.). ; Q. Is the previous query (for iteration i ) , and n and l ] l r n^ are defined by the number of relevant documents retrieved and the num- ber of nonrelevant documents retrieved, respectively. In iteration 2, n < 5 ; therefore, only the top five documents If are retrieved, and not all of them will be relevant (in most cases). at least two relevant documents are retrieved among the top five documents, no negative feedback will be done (n =0). If fewer than two relevant documents are found, any nonrelevant retrieved among the top two documents will be used for feedback (n < 2 ) . This condition is stipulated by an s — "UNLESS" of 2 and a "NEG RANK CUT" of 2. The newly defined search queries are shown in Fig. 19. For query 1, one relevant and four nonrelevant documents are retrieved in the top 5. The relevant document (17) and the only nonrelevant in the top two retrieved (69) are used to construct the new query. Query 2 retrieves 1 relevant, but the top 2 retrieved are both nonrelevant, so both (27 and 33) are used, together with the one relevant found during feedback. Query 3 finds 3 releThe vant in the top five retrieved; hence no negative feedback is used. new query vectors (Fig. 20) are used for searching, and the results are shown in Fig. 18, second Iteration. 1-40 — g3 (A C J v rg |g; O eft - * rg U If 00 X Z 43 <*> O O r g 43 43 X r g ao 1 <\i rg eg - . .-. —i UJ at < 3 or »~ eg eg eg 3 to 43 - « >* O rg r- \ * Z *>i rg m O rg,g3 tf> X 3 UO 1* i < J T r g ^ 0 3 3 -J •- ao rg N rg O - • -* -* o •H o G •H Mn CD X> Q I -< vO 0 0 PsJ z o co r» \r\ O «-< kf\ «-• |g0 O rg * CO o to rg rg g? mm — eg * - O rg rg g j i m —• — ro « 3 3 tO a) >- to 3, 3 OC - I UJ UJ 3 Ot 3 I > ac - I UJ UJ 3 at oO z SSI ±J 13 Z o mm v J |ar Uo 3 UJ - I at U J a at > UJ - J at U J a at > i Z u Z 3 at ry I to u 3 ; ac - J I UJ UJ o! u Z O r» 00 CO (*> H 43 r* Z O rg - < r*. O tf> ^ r g O rglgr z > < H-irg eg O Z u> rg 45 43 4> CO < h- «N rg rg rg o < Z UJ; 0D| * — -* -• -»/> ao I A m X UJ > Z 33 > UJ . at UJI a at z at • - rg rg rg o < eg < »- g? g j rg ff> wm mm o to UJ o u u. rg 1 2: - • N- o> o\ fO I** fO * rg rg 43 rg ° o U ci ^ o* ^* rg o x o 1-41 E) Search Evaluation Several different evaluation measures are used in the SMART system, all based on the concepts of recall and precision. measures are the following: The definitions of these Recall = - 4 D Precision = c where a - the number of relevant documents retrieved b = the number of relevant documents in the collection c = the number of documents retrieved. These measures are usually computed at a specified point during retrieval, usually either after a given number of documents have been retrieved, or after a given recall has been obtained. Two types of averaging graphs, and four types of overall recall and precision averages are generated by the SMART system and listed in Figs. 21, 22, 24, and 25. ages. Fig. 21 shows one type of graph and all four overall aver- At the top of the listing the runs being evaluated are identified (in this case a full search run (run 0) and a centroid search run (run 1)). Below are listed the recall levels being used, and the precision achieved at each recall level. point is also given. The number of queries used in the averaging at each For example, at recall level 0.10, run 0 shows a pre- cision of 0.4948, but for only 2 queries a relevant document had been retrieved at that recall level. 1-42 R E C A L L — L f c V h L A V E R A G E LEGEND: RUN RUN 0 — l - - i5 JUERIES (PLUS (PLUS 0 NULLS) 0 NULLS) — — 35~ Q U E R I E S SMART FULL DUCUMENTAUGN SMART CENT DOCUMENTAT I O N SCH KUN SCH RUN RUN 0 RUN RECALL 0.0 0.05 0.10 0.1b 0 . 20 0.25 0.30 NW PKEClilON U I 2 5 13 16 18 24 24 24 31 31 31 31 30 29 29 25 24 24 28 0.4948 0.4^*6 0.<»948 0.H734 NQ PRECISl 0 1 I 4 11 12 12 16 15 15 16 15 15 15 13 13 13 13 13 13 14 0.4P13 0.4813 0.4803 0.4620 0.4197 0.4148 0.4073 0.3797 0.36 80 0.3599 0.35 99 0.2900 0.2895 0.2800 0.2154 0.2154 0.2102 0.2102 0.2034 0.203^» 0.2034 0.4920 I.0000 0.2203 0.3458 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 NURM NORM RANK LOG RECALL PRECISIUN RECALL PRECISIUN 0.V2 82 0.4222 0.ti79 0.3612 0.3791 0.3690 0.3oo6 0.2901 u.2876 0.2661 0.1961 U.1946 0.1918 0. W i b 0.1636 0.lo36 0.ioJ6 U.7J24 1.0000 0.2014 0.3435 SYMBOL K E Y S : = NUMBER OF QUERIES USfcU I N THE AVCRAGE WuT OEPbNJENT ON ANY t X T K A P U L A T I u N . NURM = N O R M A L I Z E J . N *-4 o I O o o <£ "3 -J • • • • • • • • ! • • • • • • • • • o O • • o — ^ M *- — >• i o — j OJ IU .J 0 0. y» I/) u X x • • • • , • • f j •i t— • i l * o o o o o o o -5 JJ 2 x: o 3 1 at i/> 3 o o o o o o o o o o o o o • -* • 30 t h- o o o o o o o o o 3 >0 o o (A 1 o o o * o o o • • • . • a o - 'o OJ o • o 1 • «• o• o • o 1 r < 1 1 ecalc* ^ QJ > "3 * i1 H •H Ui 1-45 For example, at a recall of 0.10 (10 percent of the relevant documents retrieved) run 0 shows a precision of 0.4948, and run 1 a precision of 0.4803. These precision values are the averages of the precision, at a given recall level, for all the queries searched. It should be noted that interpolation methods are needed to produce the averages, since all queries do not possess an exact precision value at each given recall level. The graph of Fig. 23 shows the necessary interpolation for a hypothetical query with four relevant items. The relevant documents are assumed Thus, at 25 percent recall, to be retrieved with ranks of 4, 6, 12 and 20. the precision is 0.25; at 50 percent recall, the precision is 0.33, and so on. However, these values correspond actually to the highest possible pre- cision points, since they are calculated just after a relevant document is retrieved. In this example, after 3 documents are retrieved, the precision This range of is 0, after 5 documents, the precision is 0.20, and so on. precision for each recall level is indicated by the top and bottom points in Fig. 23 at 25%, 50%, 75%, and 100% recall. The solid sawtooth line con- necting these points is not used for interpolation; it is intended to indicate the drop in precision between the actual recall levels for this query, as more nonrelevant documents are retrieved. The interpolation method actually used by the SMART system is based on the dashed lines shown in Fig. 23 where a horizontal line is led leftward from each peak point of precision, up to a point where a higher point of precision is encountered. This new curve (the dashed line in Fig. 23) When the precision does not lie above the sawtooth curve at all points. drops from one recall level actually achieved to the next, an immediate drop in precision after the first point to the level of the next point is 1-46 sawtooth curve interpolated curve PRECISION ,30 ,25 ,20 ,15 ,10 ,05 *+ RECALL ,10 .20 .30 .40 .50 .60 .70 30 .90 1.0 An Illustration of the Interpolation Method Used by the M Neo-CleverdonM Recall-Precision Averages Fig. 23 1-47 indicated. For example, in Fig. 23 the precision value at 0.50 recall is 0.33; but at 0.55 recall, the interpolated value used for the new averages is 0.25 precision. When the precision rises from one recall level to the next, however, the first precision point actually achieved is ignored for purposes of interpolation. The achieved precision of 0.25 at 0.25 recall in the example of Fig. 23 is ignored, and an interpolated precision of 0.33 is used for the averages for all recall levels from 0 to 0.50. The second kind of average graph also generated is shown in Figs. 24 and 25. In this graph, the recall and precision are recorded and averaged For example, after one after the retrieval of a given number of documents. document has been retrieved in run 0, the average recall (over all the queries) is 0.0903, and the average precision is 0.3714. The recall and precision are averaged for 24 different cutoff points, and for 6 different percentage points, such as after 10 percent of the collection has been retrieved, etc. Three other statistics (besides the recall and precision) The first (NR) is the num- are measured and listed for each cutoff point. ber of relevant documents retrieved at the given cutoff, and the second (CNP) is the cumulative number of relevant documents retrieved by this point. These values are included to aid in the proper evaluation of runs, since the document-level averages are not plotted at equal levels of recall for each query (as are the recall-level graphs). used to obtain the average at each point. The final part of the evaluation process consists of tests of the significance of the differences between runs. Three basic statistical tests, Also listed is the number of queries the sign test, the T-test, and the Wilcoxon signed rank test, are calculated for each pair of search runs. All three statistical tests indicate whether 1-48 D O C U M E N T — L fc V t. L A V t H \ t S LEGEND: RUN 0 35 3D QUERIES OUVRTES (PLUS (PLUS 0 IN.ULLS) 0 .NULLS) RUN " ~ ~ T FULL — SMAKT DOCUMENT AT I UN — S.VART CENT DUCUMtM ATIUx, SO: RUN SCh r\UN KUN KAIMK RUN PRECISION 0.3714 0.3429 0.3095 0.2714 0 . 2 6 57 0.2476 0.2347 0.2^14 0.2048 0.1914 0. 0. 0. 1935 1853 1805 NR 13 11 7 4 7 5 5 I 3 2 / i NR 13 11 7 4 7 4 4 3 1 1 6 1 2 5 4 2 4 2 4 1 20 2d 21 15 0 53 33 38 22 7 17 CNR Nw 35 35 33 . i i 33 33 66 33 33 66 66 12 32 62 i PREC l 5 l i 0.3714 0,3429 0 . 3091> 0,2714 0 . 2LO/ J. Z"IZH 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 30 50 75 100 U 24 31 35 __^ 4o 50 53 5<* 55 61 62 64 69 73 75 7 9 0.23*9 0 • 2831 0.29H6 0 • 3 2 3o 0.331b 0.3417 0 . 343 ? 0 . 3 dfc7 0.40^6 0.42*7 0 . 4 3 97 0.4 too 0.469* 0 . 4 73 9 0.491b 0 . 4 9 18 0.49^7 0 0 / ^ 3 0.b24 3 0 . t>434 1.0000 19 IP 1/ 17 17 17 14 11 1 1 10 9 7 6 153 170 1 3 0 2 3 0 16 61 0 5* 36 3 0 12 6 5 4 0 J 0 10.04 2 5. 0 i 50. 0 * 7 5.05 9 0 . O i 100.06 33 26 16 o 2 0 0.3258 0 . 5 2 8O 0.7576 0.8»62 0 . 9 U 5 1.0000 0.2214 0.1752 0.1643 O . i o l 9 0.1601 0 . I b 3 6 17 4 0 0 0 0 0.331} 0.4997 0.b243 0.5243 0.i>36<* 1.0000 0.227o 0 . 2 006 0 . 13 74 0 . 130 3 0.i^2o 0 . 2 0 3^ SYMBOL" K E Y S : NR " "= CNR Nu = = iMUMaER NJMtiER OF UF RELEVANT. NUMtltR QUERIES UF R E L E V A N T • U S E U 1:N THE A V F K A G t CUMULATIVE NUT DEPENDENT UK ANY EXTR APGLAT I L N . PLKCENT UF TbTAL NUMbbK Oh ITEMS i > , C O L L E C T I O N s Document-Level Averages F i g . 24 1-49 3 ~5 O • • • • • • • • . • © c © I c * • * • • • • • • • . • • • • • • • o o • • • •• o c © — o • • • • . . • • • • • • © X U. or o < J u' u M • • • • • • • • • o • • o o o r — J> JC •" V/> »•» —X«— X «l < 1 1 .< X • 1 • • . • . • • . • • • • ' • • • • • • •—> • . • © • O -0 c at: f/) a; hi) I *-y j • . • • . . « • o * • • «•* • • o o o •* o :< :• •i: (i) . i — —o « * ~* • . . • • i • • • -* • o . • r> o • • • • • • • • • © o © • • * 1 3 3 > 1 • . • • • • . # -* • • « yJ% • • -3 • • • . . • * . • . . . • ' i • • j • « « * • • -^ .• • • . • • • .o • © o o o O • * -J 3 o • S . j —* 3 « • • •. • i 1 • • • ' • • • < / at X • • • • ,. . • •' • • • • • • • • • * • •' • ' • £0. * a X I O O 3 § o o © o 3 3 o o o o 3 3 *5 9! o o o © 8 3 X ment • 1 <) i 0i 1-50 a given difference in two averages is likely to have occurred by chance. A one-sided test is designed to compare a supposedly better sample B, with a given standard sample A. H and H . H H Specifically, one proposes two hypotheses states that two samples A and B are produced by the same states that sample B is statistically better than sample H , that a difference between distribution; A. H is accepted if it is unlikely, under samples as great as, or greater than, that observed would occur by chance. A two-sided test similarly compares two samples under the same but with the alternate hypothesis different distributions. under H H H H , being that samples A and B are from is accepted if the probability, Here again , is low that a difference between two samples is as great as, or greater than, that observed would occur by chance. The T-test assumes that the differences a. and b. , are distributed normally. d , and standard deviation d. between the two measures, d. Explicitly, it is assumed that d and a, d has mean o2 . Note that d are compu- table for any distribution, including also the normal distribution. In par- ticular, it is known that many sets of differences are not normally distributed. (For further discussion of the T-test and sign test, see reference [7], page 12, also [3] chapter 8 ) . The sign test assumes that a result is equally likely to favor Thus, it measures the probability of a more either sample A or sample B. extreme distribution favoring B, or favoring either A or B. The Wilcoxon signed rank test postulates that a greater difference between paired samples is more significant, but only as the numbers affect the ranking of the differences. For example, differences of - 1 , 2, -3, 4, and 20 are equivalent to differences of - 1 , 2, -3, 4, and 5 since only the rank 1-51 of the ordered differences favoring a sample is important (not the actual values of the differences). The Wilcoxon test assumes that the two samples come from the same family of distributions, i.e., either two normal distributions, or two binomials, etc. The three tests are performed for eleven points of the recalllevel averages, and for the four overall measures of recall and precision of the document level averages; in addition, the tests are also performed for the 17 cost statistics. The three listings for the three different test procedures (Fig. 26 — 28) cover only the first option (eleven points of the recall-level averages plus the four overall measures). For the T-test (Fig. 26), the following values are given for each of the fifteen statistics: the mean and standard deviation of the statistic for each of the two searches (A and B ) ; the mean and standard deviation of the differences between the statistics for A and B; and a value T, which is defined as _ . (A"-B") * /IT" where N is the number of degrees of freedom (which is one less than the The one-sided and two-sided probabilities number of queries being tested). (indicating whether a difference between the two samples as great as, or greater than, that observed would occur by chance) is also listed. Finally, the fifteen one-sided tests are statistically combined into a single measure also listed. The sign test (Fig. 27) gives the number of queries favoring search A, favoring search B, and tied; the normal deviate ignoring ties (computed by using the binomial normal approximation); and the one-sided and two-sided 1-52 o o (M o o o> ir\ CO UJ • ~* •• ro m •• ro «\J CT O uo «/> 3 3 or QL 3 3 00 00 UJ U J •O LU O CD OO rvj K ' W ^ C * << 1 1 •- IT O O <' o c > < UJ -J LU LU -J c QC •a ^ ) o i o ^ rvj co o p © 0O 40 4 0 N l | O© C0 O0 O0 f0 O0 N ) O 0 o 0 0 0 o • • t • t I • o• l • * o o o• o•i • o o o o o o o • o o • • o i l «0 f * - l p - O r g j j ) rw - • o •J- > _J -J o ^ o i-4 *s. — << CJ JJ z z r<> t o *-\fVi m m Nr o gj ro ro N k rg C* 4 ^ ro o N o O •^ m o O o o o O ^ ru o o o o o c o o o o o o o o o o o • • • • •i • • • • i • o• 0 • 0 • O o o o o o o• o •o o o o | — CO )L/\ iT mml f~i H -H * o x x *•• cr o or ar 00 O CD 1 LL LL M D Q w o LU a: 1- r*- if^ ro ro ^ fw O O ^r "h1 11 m CO > rr\ o o eg i 1 CO m sO • 4 o rg r g - 4 00 O O rg r- — 0 s >t rr. K lir ^ > r g ^ TO O O p G O 4 >o f^ K i P M QC QC C in >e UJ < Z te-i L» UJ -J o TK o X LL LL. z o a a z z u o o >—• •—• < z•• M 1 1 •1 1 ro l T •fM*f* r\j <\i oj c>4 o • rg m 'rg • • • • • j •• * o >r JJ - ^ k \ j ro 1 11 1 •• • J 1 o o < o. o < < z z u — »—• * =c u X X n t- 3 O -D O O 3 r g rg UJ -» I- 1- 1 2 u LU ac CO ; 1 a O LU KJ -J < a O0 aoU* co o >r 4 CO 0 0 UTi 00 If* 4" m CO rr\ 00 i/> O h- lr- ro f >r 4 IT r- 0 0 o|o -. o o O o O O O IO O O O O O * • ff\ 0> rgleo O t > OD O J3 *D 0> O O 4 in ~* 1 4(rg <\i o o -j 4 g * LO -< o o O O O of o o o o•i o o• o• o• o• o• o o• o• o o• o o• o• | • • • i l l 1 . - 4 O 1 1 1 1 1 1 1 1 1 o o ot o 00 1 1 o o a M z z to O ( UJ LU 0C 3 UJ »vl M X 1 LU O LU LO o o o o or • O LL *lO •— 1 —4 UJi< _I »o UJ —1 CO 0 0 UJ UU o oc o ac < «i LU UJ O0 0 0 X -J (\J z a x: ac < a z c o a c ^f I N mat —4 sD >C 30 og r g m O O r g <\J 4 L / > O 4 sO >0 v O l f > I O »r> ru ro (M P O m r*> m ro 1 r g jr> r g r g | ^ 4 mm rLL - * 4-» CO (U H L^H CO CN buO o o —« • o• o• o • O • o• o• O• o• O• O• O o• o•|# o• • o g? 4 O -* LC L O o co r- cr ir> o O 0^ ao g)> 0> — \t\ PO r o O ' l T l LO m >t *t PO PO r O | 0 rg o g> r* 4 r g r g i g * tM H- * < <* —1 CD CO • z• oO or LU CO O X QC ~D LL QC Q Q o z h- LU LU LU LU LL LL 2 — u• JJ CC CO z r-t *• co »^- m* >0- r-o ro * >r r- O f > o *n « 4 >0 rr> 3 C O O O PO LU sO o X < ar. ac CL UJ < X a* r* 0 | 0 co ii> O o > o o o 0 NO X •- o o oo o o O X HJX Z O X e> LU •— IOJ CO UJ —4 aj - o o rg Q -J UJ J O ' _ ) UJ Z> LU LL U . — I< »LU QC QC < LU o z o X *-* z «: e* x LL o z »— L 00 o0 i•• < o LU QC OJ ac CO < rg eg Z X Z LU 0 0 CL J J ~t 0C aC JJ U. 3 o0 UJ •—• i •• •• af LU a: « or o o o < * 1 I uO * z Q < (•> h- co rg ~4 O 00 o j j ) —. ao ro >0 >0 >0 >0 ^ 4 ro ^ fNJ < 3 O X LL CJ QC LU < -J z a 1— OO O • 0 o ao *f\ o o 4" - t -o rsj r*> rg t> O i> co r* >0 - « h a r»PO PO rg rg Q I/) UJ o a z o oC 3 3 3 < co XI rg rg 4- > D < O UJ UJ LL * a 2 IT ^ -^ o *» -» LU O -I -4 3 •*. < > / •XI < X u < z u < X TO K . ^ i -o X» r t —4 O 4" O O r*» o o o o J J M A l A ^ <• • • • • • • • » • o•p • o o• o o o o o O >t >*• • • O• 3• •- | 1 o o o z K J J UJ _» - J * •• -1 —4 z o < u > -• « u. *~ -• -. z 00 O ~4 < CO CO ^- XJ • < — ac 1 a 1 z 'JJ < < O LL > cO 1 O \ cO CO UJ -it cO L> •—« —LL•—« LL • • < CO —— — ' •• » • z o •» L O o QC CL at < z CO CO OJ - a o z o — ^O • < z -oi < ac •t vO •c a. X X at at: ^ • » o• o• o • • • • • « • • o• « 1 - J UJ CL z -* o C X O X •- a co X on o o Z z ^ > > - 1-53 o o 03 o 03 OO UO CL ec OO »/> UJ UJ X X O 03 Q < at O < «t; UJ 3 3 << 3 3 UJ > ro —z z o x x o UJ O CO 0 O O O O O ta CO *• of rg * 00 O O O & o o* o o o n> i»* ^ co m r»- o> oo l i r e i i r o o o b ^ a «o !o> a W t o s e o o o 00 rCD O o l00 I N ,— < Q Q C C O O U. U, C O QC QC ac o> o O O -< o c o X X < J CD *- o v » QC 1 i o. ~* > JJ •ooo-*—)~«oo©ooo ur\ Q0 0 0 O UJ ~ Z U. U. :0 O O o c o ,o z z ,-J < < o »~ UJ QC . ' o n ! »- o a o o K O O cc O o ~ -« cc i ) o -• z oo, 3i X O Q Z C 0^ 'JO 0^ fO f ^ O VOtf%.m ^ 00 irsj O O l O ^ f «t • t • I I I I I I IrM «4 ' • * I I I ao rg IOO un un a* o^ > ^ o o -^ loo ao o w> un rg -4 rg r co <*> 4- r g r g • •'•••• O - * r* 0> 0> ( ^ i/> < J N t • • • P b p b b o1 oI o o-- o !«-* ~< I I I P o I 03 kvj H kr 0> co 03 4> poo Icy rg rg 00, UJ y> a UJ la* jZ O O ! 3 - I -J UJ UJ :QC * - - • ' O U. UL I Z Z I — < «* J O OL QL z -j o so t O QC 1 0- >C ^ |h- o > - h - r^ m r n —i r*» m (s- f«. t * . ^> P O O O O O O O O O O i s0 S T N O ^ O O H o rr\ O «t f * -4 O !o O O O | • • \*t oo (rn O tf> h H - i i • • • • • • • •O • • ; O O O O o o o o o o UJ 0 o • O 1 • • |o o ; o o o o "P O, Z P cc»k Q i C LU UJ k>o 00 ' C C O O > >' o 00 — Q ko o o UJ ;•-* oo oo H|» UJ ULi I O ac — a «r o o QC QC !Z < < O UJ u j « - UO OO UJ O -J < O < XX < II LO CC I a r* rr\ r* u> j j ^ >o co >o|f^ -r co u"> co ao leo eg I H O I O O cr\ - ^ >n m c*> < N I O O I O O r » 00 o o o o o o0 0 0 0 p o 0 0 0 0 0 0 O O p O p 1 O o e o o p O! RJJ UJ X X I < 3 (1) 1 < X 00 X (/) [> CN 1 p hi) •H (/I SI l bJ3 •H IJ-. > oo UJ UJ eg •^ >0 0* ^ ro \f\ ro rg 1^^ >0 O* ro r o »^ »M o >o O 1* * ir» ^ i/\ g> o ~4 UJ UJ be ac kx» UJ 1 1 QC ,—» ao co 0C Z r g i—• r g ^ r g |rg r g o, •! 0>! o UJ b O k ac r* < be oc o oia t o UJ UJ UJ UJ o o Z ~i O O o GL j z u. a. f< < P 3 o •» Is IJ l O _ 3|UJ CO U. U. g? O >0 >fig> < - 4 - ^ ^ m ro t I co 03 rO 0** 00 >0 r g r g |pg -* y* ** p ar ko to Oh o o rg kj o <; ok <|oo I I I QC o »-!tZ oO oO , < rg|rg at O >• > < si QC x z{z c* oc of o oo u. QC M M U J wO oO a. UJ U J X or,ac 3 u i UJ • 3 3 Or 3 rgjOl iC z QC < 1 !S§ i > K ao 0> ao f * | ^ ao f* O H rg UJ -1 !< UJ u. x K < u> co t o CO i/» 1 p or K ° |a. a. 1 JJ •- p z IU aC 3 v^) UJ X ;o < ;z* 0 Q C O > < 'UL O X •UJ X ao r»« ao - « ~« —« UIIUI o O CO Q UJ o —» < uul O -I,-J o~~ I I WWlU. 2 u0 OM«M • •• loo i n gj|*» I • •; • • • o> o | o UJ Z 00 o UJ to z i Z z **: 00 k 3 00 5 oO W i n* "* oo oca. 10 0 0 0 s\wT o oc Q J o C O O O I UJ -I Z Z Q Q C C X o o X 0 o 1-54 probabilities for the test ignoring ties. The normal deviate and the one- sided probability using ties (based on a method developed by Cathy May [8]) are also computed and listed. combined into overall figures. The Wilcoxon signed rank test (Fig. 28) gives the sum of ranks favoring search A and favoring search B; the number of degrees of freedom (specifiThe one-sided tests are again statistically cally, the number of untied pairs); the normal deviate (computed using the Wilcoxon-normal approximation); and the resulting one-sided and two-sided probabilities. A statistically combined significance value is also listed. 3. Access to the SMART System The SMART system exists at Cornell as a private library system, lo- cated on a disk, which is accessible by reading in sets of control cards. When the SMART programs are loaded, a routine called EXEC receives control. This routine interrogates control cards in the data stream to ascertain which routines are desired and transfers control of those routines in the sequence requested. A typical deck setup for the system is reproduced as follows: initiates SMART routines sets up document groups for a collection already on file [ //JOB Hi /-SMART ( CLUSTR (parameters) (parameters) (parameters) / performs retrieval runs using methods called for by the parameter cards SEARCH (parameters) (parameters) 1-55 o o o r\j «* WO 0 0 uo UJ >• o *-* •• r« in O 3 3 ac ac 3 3 < DC UJ » • < •J UJ 1 X X o ^v O .-« N O 0> *a o —» z z o , -J — 1 1 , • 0 < < a O a ac ac c O 31 U. O 1 • Z u. H- ^ M UJ o < CL z Z o o o UJ .—1 < » z a UO M i 1 AL TO 0 COM 5 LIZED «* o z z u r cc C o o o o - J 1— l ~ —1 o oo ^ . rp-*fNi(rmeDco^iQO^^OvO«>|i/>ir> ; 1 oiooo^«ow;ooooo«Mw o o o o o o o p o o o - - O i O O • • • • • • • 1 • • • • • 1 ; 1 O l O O O O O O O O O O O o o o 1 3 Z9 e g r\J UJ T <, z aj o _ 3 --J JJ ac O a a 1 XJ •p CO LJ LU 1—• z »-. C RAN RAN »- CAL a •—< 3 u. O U- •kl -J, UO ac i Q. — i & *y f\ \P ( N f, \ i 0 ^ i P O r s j f \ j i - t r r i ^ 10* 0> o ; ^ - * > t ' - ^ > t ' ^ o o i > t ' f N j r v j o r o > t If*- r O O O O r o m - M l o o O O v n ^ O O o o o o o o o j o o o o o o i c o • • • • • • •>• • o• o o• o• o• o • o •i o o o o o o i o ! i o o o o (U H G nj fK i cu o• UO X uO U J O u Q UJ z uO 3 t— 1 UO uo - w UJ UJ UJ 1 tf\ 00 z ac i a ac S iTi JJ • UJ' o. — * **! -j| X m* O O SO °D 0s O O o «M o o < » *+* O 0> 0 s t|Lor»~r*- > r « t i r » j o ^ ^ ^ > o "O O Q UJ r o «t ac • m m • 00 CO —* O C O O O ^ < O N | ^ t ^ t • • • t rvj r n - - -* OJ <\j r \ i f \ j r e - M *-< K» ru • • • • • * • * 1 LL r in U. r - » • r: o o a H •H 01) •H 1 O uO O ; ! a Q O UJ -r X z < — o > UJ UJ UJ UJ u. U_ »-4 1 O QC o w ac' a.; UL a * o o -\J at: UJ •— - J uol»- - J J- 3 CL) u . • , _ < O JJ JJ 2! •»l _J _J o! a z (/> cc ^ z < o ac z a. ac •—• fNJ (fNJ f\J (\J >*>0 \yQ o >o ao oo o •4 JO O ^ — ro oj *• r g rvj rvj f\J o I>o >o CO UJ ht— o X X Q UJ UJ ct u_ u. ! z • •< —I JX UJ X o 1 -T* C7- u. H- o o o ur\ UN * m O O O O O O >J ) -ULI — u,• z «r ac o a uO z r -4 ^ o m o ^ o *«o .•*> >r O fNj 4 3 »/\ O >• * ^ - 4 r g r\i eg ac a UJ < O 3 a» ac UO < -j o UO UJ UJ QC UJ Q u. O ac UJ CO X 3 Z 1 1 eg CM * z < ac 1 X 3 UO ^0 >• < u. i • • UJ — X O a: -J o < 3 uO | 1 z o -J UJ UO a. L-J y> UJ 5 * < u ^ * o < CO a UJ UJ a. 1 _ * or ac _* a UJ 3u u. 3 -JB or a* li -M ^1 UJ •—• uo » UO UJ, * l 3 ' ^ < z O <: z t — 4 1— •i < a nvooou*>'nm|oooooo O O -^'j^^Mf*. U*\|i/> f^» uT\ * -t ^ ! O < M " ^ f O r * > 0 0 H H N N * O i • u. * o ao *> o • • — X o < » UJ z > ^ o * k— A < JJ X o o :> X 3 < U. uO -^ : : , 1 J u. Z 3 J XI -J *m • * » 3 « ~* o — >• — u-• u. o •J UJ UJ -J -J ml — • » -»i z 3 »-! a. 3 Z o 1 : i ! i i M t i UO i O H f N j r ^ i . n - O K ! 1 1 o k - t O ao uo < ao N* < o t> ** LU •-• -J z a ? o. u. 3 z •• > UJ * -J UO CO JO* O • > ~ 3 X o z ^h « •< vO UJ CD 1 < / > - »-o » — H. < »uO or! |0L * < oca X O O -J Z X O Z • < 1 1 at • o • o • o • o • o • o • o • o • o •- # 1 i UJ ct Q UJ Z • M CO UJ X o X z . o ac sc - J O ac X •3 o ao >• -o 1-56 performs statistical averages for the previous search signals end of job AVERAG (parameters), (parameters), L STOP 4. Basic SMART System Flowchart The SMART routines fall into two categories: routines that can be called with control cards, and routines that can only be called by other routines. The latter set in interconnected by means of complex internal vectors, designed to make the most efficient use of in-core storage. A flowchart is produced in Figs. 29 — 33. by control cards are enclosed by boxes. The routines which can be called 1-57 G bO C O CD C O •p fH •H bO C •H -M O C^ a; '» ( C D bO U 0 rtf U C D H > < 2 \ H 01 c C D ^ m B U n1 0 'd bO •H n CD w fc G ) luste Routi C O ^ O C -H D C O C D C \ N 3 e o H O ^ C (1) > o rches ai C D C O 4- J C O U P bfl bO C £ •H •H •H r r M O rr: C D 3 0 4 J a & (H ..a a (0 l w u ^ c o •H c u m -P C D 00 a o c* n 2 a c Tr:I (/) CO a) }H . H fc o1 CM H o C D T: A:: fc C D 4^ C O ^ \ o 0 fc 4-' m CJ 3 ^ bO •H U, • E-H > C H D O C a) ctf MH tf < (/) : • : & +-> ° o C D C O C D -H C 3 -H CU 4-> a co - H 4-J ^i co ^ O 2 CD 4-> > M C O P^ 3 \ cu £ a C 3 CD H C > X P •*-• O C O C D 4J (1) Rou C D £ E- M X 3 cu • H c \ 4-» CO P (D > O O -H 1-58 rd o o g ^ s — > > p. M 0) W 4H 3 W (15 >, rxJ (T5 C XI O O -H 3 P C O O w «H nd C D H £ X3 C D rd P P C O ^—^ C O 3 P 3 (d C O C D Xl P ^ O , P. o a c p o P 4J H C) 4 J v \ \ O «H rd p > P O C P D C) P L > O.T1 C D 0 ] rd P CD rd T) bO P O C C D O -<-\ C C D C O C D P 3 bO C D C O a s H O \ X ¥ry> C TJ O C D 4J C O hH Q) ^ H p CD T3 O £ ^ w ^ ^ () TAX ^ > H 00 ItX - C O i^i C O O •H p C O O >. C D H > tT5 £ C O C D O C C rd >^ D O P rcJ rd C C X D D >> l+H •H C O ' ' W M ^ C •H O H ( rM KH \ a 'll 00 \(D U M Xl 0 O m £ i C D ^ PH a, > c a , c rd o o >> H C O ^\ \co g rd P n -o H P C D rd 3 C TJ C D P rd P. C D ata rd C o .CD X rd P co C CD bO C D -H C XI > ^ fd 4 » P. O MH c e C D 3 P P CO p CO CD C X) O C O D W J W p (^ D < W C D P^ D > 3: •H rd P (1 CO C •H P :i p c C O D O, O o o o r^ P :i si C O ^ O P X C O C D P P O P 3 (X •H r a ^ 0 P o aj nd rd C 0 P C D C O O e P O O C O nd rd ^^ S H \—i o P C O -^ x\ o x P n3 - P . > X CD S 3: W r > i "^ O C D > < MH O •H C O P P C D £ CO CD C O (D O O ^ O* P O co n £ -H >, o o C O D 3 -H P C D £ 3 > C —S o o Jf O P W D -W D C £ O T3 TJ J C CT1 D P 3 bO C O O D O O •P P C - P £ MH X I MH C n j O to • H O P C D -H O 'U C/) / O Tl P C C O ^H H H ^ / r d 0 • H M H C) H3) - H M H HH a, 525 H .-: rd P rd X c) o bO ^ ^ ^ e MH >, ^H rD P (O MH •H >> o O C O >i • H XJ rd 'il C D cy CO CO CD T > ^; M fc M cy: n n ( CD a Njz: c o ^ a H X w H w ^ ^ /•CO ^r^XcD ^ O ^ O ,_q p W CU O C O £ rd P X C D rH T3 H C P rd - H O I-H ^ H ^ O ^ rd P •H C D O o ^ -H •H C^ p 4^ C D O O G P. CD • H O «H MH C TJ ^ O H C C id fl >> CO ^ ^ < :•: oo M ^ H ^ 55 M 00 MH >i MH >, 0 ( -3 < P-, n P bO •H rd C O C \ C •H •a O C D O 0 x 3 «H • H rd ,y 1 -a CD x : cr p 4 ^ C D ^ bO CD a o 0 H «H ^ CD •• 1 r CO T J rd ^ MH -H >, a G C ^ a ^ a a :•: a, •H C C C n o P rd 3 a o c o P O MH "^ y; O H X H x a, p a. C C o D D C O P co £ C D P rd O p U C rd X3 ^ H : C O rd ^ 'X O MH _r^: a P MH C O a X 00 < K C O B CD P T) O O MH O r d P O £ w P E-« 4 j X ^-^" S H hH c a; o C O (D l X (D C O rd 4-J r \ + C D J 4-l 0 p a; rd rj rd X -H C O D P C D (^ C co O rd U co MH P C D P O rd rd O :•' w s 0 ^ ^ o o x: P p C •H O C J 0 1-59 CRDCOL RESCOL LSTCOL \ adds a vector collection to the system or modifies an existing vector collection \ inputs the search results of a collection for averaging lists a vector collection I LSTITM \ lists an item (such as a document or query) LSTITM SMART Systems Chart — Vector Handling Routines Fig. 31 AVERAG 'computes different kinds of averages for search runs LSTAVE lists and plots the graphs for the averages calculated by AVERAG VERIFY runs three kinds of significance tests of pairs of averaged results SMART System Chart — Averaging Routines Fig. 32 1-60 C O p rH 3 C O (D P. C O (1) G 0 (/) rd G 0 •H P C O C D P O G C C) C D 3 P P 0 s results d p •H P cp P. O rd rd (H H rH 0 0 H" p o p. G C O W D W c P D cy MH G G 0 C O G C O p G rd CD rH C D P l •H C O P G G •H bfl P o > o p P •H C O C O (0 O G a) F, G 0 0 P C D P •H XI lacD P 0 u o (O P •H rd H3) CO CD o H rd CD CP CO CO < ^ O o cp CD co G •H rd MH O G C O D 3 -H (U -M G co rd C D P P G C D H H W rd o .y p G G •H P a CO QUEU C O P G CD ,-C P -H £ T) w G G P co ^ P +J ^. ^^\ L ^ ^ e C O (D G O C D O r Q C PD G -H rd B G 1 rd G g > rj O H C D O O P >> cd H O.H X) e CD P CO >i CP I CD Pi -O • •H nd (3

0 p •H P P G C D p ,G G •rH G P 0 rH C D -H i •H •P a) C D o H W osit cons rue efi uer eed P U" 4- ^ O •rH ,C P O co rd P H O rd rQ O P. Mn P - H fTj G H3) O O

CD w .O' ' Q •H 31 CT MH •H P P C D o 0 H e P> P o G o rd rd C D D C D C P P CD P P X ated ance group p PM CO o "V (/I G G a) P rH > rH G e a C O p C D H ,G p P CD 0 • H , P CJ • H 0 O C D C rG P 0 C O -I ' P rd a rd P C D C O G < S a rd rd 3 = > C D H G pa; C D rd U 0 'I I 0 -I ^ P P p P 0 G CD cP P a) C P O C D ^ 0 rd D . J J D O G >> ^ < -PQQ O SB > P 0 C D p a) S H hH rC P P CO P H ) ( < N \'d a Q rP p (0 P P C D P C D P CJ • H G G O p B 0 p C D cp Q o rd 0 rd p INE w p 0 O >. 1-61 i ' bO P •H 'U p* 0 •H P p (!i P ) P Pi P: c o C O 1' p (U p 0 'p p 4 » P P o cP p o u u rd B rC (7] C O 4 > /^ C D ,P I ' C O U p M ) P P U O 'P (1) ' p P (U •1 i M 1 0 P (h p 0) > o 4 ' •1 i P r 1 C O I > 0 bO p c u P •P P CD p 0 p :-- P P T 1 '11 0 O co < (U -P O O 3 / H / w P^; i O O CU PH O 4-> P Pn" O PH p r 1 (11 0 (D P QJ P 0 C O p 1 i > P rP (!) (/) CD H 00 w D PP) ()' (i) P •H O Q,X C O ,,< (!) p p p C O J O w t! (ii ,-c •*-! p £ 0 : I I cy :D p •r 1 p •- 4 4 J P ai 4 » o .. 1 P P O Pi P P blj p T 1 p (1) 4 •' H p P a 0 a o T3) 0 p bO bO H id C O P •P •. I P CD P CO p c ) [ < i—i PL, :1 CO C O C '.-i p •H (j -fd 0 t ' 4 ^ a) nd H (/) P H O 0 HI a a Q (11 (P < r ) r i p p \ 'P 0 P P P P P w XP M f-M bl Pb .-.: a (/) a (\) 4 ' (/) P, CO F < (15 P) •H In c u U ach 'p •p C U p T | PH 0 4 ' p X) :•> < rP o p p l+H 0 4 » 0 G c u C p C D ,P I » ) (/) • CH ' (0) O n, (3 P. P P C •i i O p C •H O G •P bO a; 0) nl P PJ C O p p C O P 0 -4 0) <1) 0 U p •H X) p • H C O •H + » C D 0) > a) a ,-P 4-J C D P a; l B O 4i P 0 (J C D ,P 4_j P O P P P 0 rd fc C O P P !/) a p (U P C O p C D P p P •H

a) Pi d) X J 4-' P •i I (/) £ PI p iT w M bn w Q H Q P4 < 1-62 References [1] E. Ide, R. Williamson, and D. Williamson, The Cornell Programs for Cluster Searching and Relevance Feedback, Information Storage and Retrieval, Report ISR-12 to the National Science Foundation, Section IV, Department of Computer Science, Cornell University, June 1967. D. Williamson, The Cornell Implementation of the SMART System, Information Storage and Retrieval, Report ISR-14 to the National Science Foundation, Section II, Department of Computer Science, Cornell University, October 1968. G. Salton, Automatic Information Organization and Retrieval, McGraw Hill Book, Co., New York, 1968, chapter 3. [2] [3] [4] J. J. Rocchio, Jr., Document Retrieval Systems —Optimization and Evaluation, Harvard Doctoral Thesis, Information Storage and Retrieval, Report ISR-10 to the National Science Foundation, Harvard Computation Laboratory, Cambridge, March 1966. [5] R. T. Dattola, A Fast Algorithm for Automatic Classification, Information Storage and Retrieval, Report ISR-1M- to the National Science Foundation, Section V, Department of Computer Science, October 1968. [6] R. Williamson, Centroid Searching (Tree Searching), unpublished paper, May 1968. [7] G. Salton and M. E. Lesk, Computer Evaluation of Indexing and Text Processing, Information Storage and Retrieval, Report ISR-12 To the National Science Foundation, Section III, Department of Computer Science, Cornell University, June 1967. [8] Cathy May, Evaluation of Search Methods in an Information Retrieval System, Term Report written for Computer Science 435, Cornell University, May 1968.