XI-1 Interactive Search Strategies and Dynamic File Organization in Information Retrieval E. Ide and G. Salton Abstract A great deal of effort has been devoted in recent years to the evaluation of automatic or semi-automatic information retrieval systems. Recent evaluation results indicate that the search effectiveness presently achieved, or likely to be achievable in the foreseeable future, is much smaller than expected by a majority of the potential user population. Furthermore, theo- retical advances in language analysis and data organization promise only relatively modest future improvements. The most significant advances in retrieval effectiveness are likely to be obtained by adaptive interaction techniques that extract information from the user during the search process to improve the organization of the data space, thereby providing more effective search and retrieval operations. The various user feedback techniques described either modify the user queries in such a way as to bring these queries close:- to existing groups of relevant documents, or modify the document space to bring relevant documents closer to the corresponding search requests. 1. Retrieval System Performance Over the past few years, the design of improved information storage and retrieval systems has become important to an increasing segment of the technically trained population. As a result, considerable attention has been paid to the development of automatic or semi-automatic hardware and software XI-2 systems designed to store ever Increasing amounts of information, and to make the stored data available to selected user classes. As the interest has grown in the development of automatic information systems, procedures for evaluating the performance of information systems have also become increasingly important, since the large investments necessarily required in a mechanization of information handling procedures would not be justified without some assurance that the resulting systems could render reasonably effective services. Evaluation studies of information system performance are often carried out by choosing some subset of the information requests submitted to a given system, and identifying as "relevant" to each query a list of items that have been hand-selected by the user or by a subject expert* The effective- ness of the search and retrieval system is then measured by determining the extent to which the selected relevant Items have been retrieved and other items have been rejected in answer to the query sample. Two standard retrieval measures have been widely used to evaluate retrieval effectiveness: recall, defined as the proportion of relevant items actually retrieved; and precision, the proportion of retrieved items actually relevant. A perfect system, achieving both maximum recall and maximum In fact, recall precision, is not generally achieved in actual practice. Is found to vary inversely with precision, that is, as the recall of a system goes up because more relevant items are retrieved, precision goes down because more irrelevant items are also retrieved. Therefore, the user must choose between obtaining either high recall and low precision, or high precision and low recall. The average search results obtained in several recent retrieval evaluation studies vary between a recall of 0.1 at a high precision of 0.9, for XI-3 specific and narrow search statements, and a recall of 0.9 at a low precision of 0.2, when the search statement is interpreted broadly. [1,2,3] Operational systems normally compromise by operating in the middle ranges where neither the recall nor the precision are very low. In fact, the Medlars system of the National Library of Medicine is said to operate at an average recall of 0.58 and an average precision of 0.5Q, thus producing the correct retrieval of about sixty percent of what is wanted, while keeping the amount of useless material also retrieved to about fifty percent. [3] Two pragmatic approaches are being actively pursued in an attempt to improve the retrieval effectiveness of existing or proposed information systems. The first one consists in using more refined information analysis procedures designed to generate query and document identifiers more reflective of information content. For example, the experimental automatic SMART document retrieval system which provides fully automatic document and query analysis, includes procedures for automatic synonym recognition using stored dictionaries and thesauruses, for the assignment of phrase identifiers instead of simple terms, for the refinement or broadening of information identifiers using stored hierarchical subject arrangements, and for the use of statistical and syntactic language analysis methods. [4,5] The second, more recently used method of improving retrieval effectiveness utilizes automatic information displays during an on-line search procedure in an attempt to prod the user into submitting more viable search statements. Excerpts of stored dictionaries or term lists can be displayed, as well as term frequency information, lists of related words, and titles or abstracts of stored documents. [6,7,8] While both advanced language analysis methods and on-line interactive display techniques appear to improve retrieval effectiveness, the increment XI-4 of improvement generated is relatively small, being generally from five to fifteen percent. [2,7] It thus appears that by methods which are well under- stood and seem economically reasonable, recall and precision figures of 0.60to 0.65 are presently achievable at least in experimental environments., Whether more dramatic improvements may be expected in the future — for example by the use of more refined grammatical models such as transformational language analysis — remains to be seen. Some evidence exists to suggest that presently obtainable results are only about twenty-five percent lower than those produced by an "ideal" search system, where human subject experts conduct exhaustive manual searches through the complete stored collection. [9] Therefore, recall and precision results of about 0.75 may constitute an upper bound to the performance of both automatic and manual retrieval systems. Whether any automatic system can achieve such results depends to some extent on the ability of the system to adapt to the expectations of the particular user population being serviced. Heuristic methods for this purpose are described in the remainder of this study. 2. Request Space Modifications A) Relevance Feedback A principal technique for improving the performance of automatic information retrieval consists in using information supplied by the customer in order to alter the request to correspond to the user's need. Speci- fically, the query representation — consisting in many retrieval systems of weighted sets of terms or concepts — can be changed by adding or stressing concepts which appropriately identify the user l s information need and minimizing or even deleting concepts which are not representative of the user's XI-5 need. The altered query should then be more similar to the stored represen- tations of documents relevant to the user and less similar to the representations of nonrelevant documents. One way in which this can be accomplished is by performing an initial search of the collection, using the original query, and retrieving for the user's attention a small amount of output, consisting of some of the highest scoring documents (those most similar to the query). These documents are examined by the user who identifies each retrieved item as either relevant (reflective of his information needs) or irrelevant. The stored represen- tations of these judged documents are then used automatically to adjust the queries in such a way that terms present in the relevant documents are promoted, whereas terms occurring in documents designated as nonrelevant are demoted. In a somewhat simplified form, a typical query updating procedure is represented by the following equation: n r r. - 0 7s s. q. = q. + a J " —l+l —l k - —i ? . —i , 1=1 1=1 where q. n (l)* represents the updated query vector, r- is one of —i is one of n s n r q. the original query vector, ' and s. —i document representations identified as relevant, ^ ' nonrelevant documents. [10,11] Two major variants of the relevance feedback process described above are discussed in the following subsection. The simpler algorithm, called positive feedback, uses only the retrieved documents judged relevant t: alter ':In the experimental system discussed here, terms having negative weights are deleted from the query Cgiven zero weight). XI-6 the query (equation 1, 3 = 0 ) . The second variant uses both the relevant and nonrelevant documents retrieved to modify the query Cequation 1, Q > 0 ) . A study of the differences in performance between these two strategies reveals an important characteristic of the space of document representations, and leads to a proposal for several new techniques designed to improve retrieval in similar environments. B) Positive and Negative Strategies A typical positive query alteration process, where concepts may be added to the query but none are deleted is illustrated in the examples of Tables 1 and 2, An original query statement is given in Table 1, as well as Following the the analyzed query "vector" in terms of a weighted term list. addition of terms from document number 102, previously identified as relevant, the revised query vector retrieves two more relevant documents, numbers 80 and 81, with ranks 7 and 6, respectively Cfor retrieval purposes, documents are always ranked in decreasing order of similarity with the query). These two documents were originally assigned ranks 14 and 137 using the unaltered query vector. Table 2 shows a typical retrieval output list, giving the ranks of retrieved documents in decreasing correlation order with the query. vant document numbers are identified by T Rele- Rf. The identified relevant docu- ment number 94 (originally retrieved with rank 14) is first used to update the query. spectively. This pulls up relevant documents 90 and 95 to ranks 7 and 10 reWhen these two new documents are used in turn to update the query, additional relevant items are retrieved, until finally all five relevant documents are retrieved within the top twelve items following feedback run 3. A typical recall-precision graph for positive feedback is shown in XI-7 Vector Type Illustration Initial Query Q 146 What inform!ion is available for dynamic response of airplanes to gusts or blasts in subsonic regime airplane 12 gust 12 available 12 blast 12 regime 12 dynamic 12 response 12 Initial Query Vector information 12 subsonic 12 Relevant Document 102 retrieved with rank 2 Cpartial vector) gust 48 lift 48 oscillating 12 subsonic 12 available 12 penetration 12 response 24 airplane 12 gust 60 sudden 12 blast 12 lift 48 dynamic 12 Query Modified by Document 102 information 12 regime 12 oscillating 12 subsonic 24 penetration 12 sudden 12 Relevant Document 80 (improves from rank 14 to rank 7) Cpartial vector! Relevant Document 81 (improves from rank 137 to rank 61 Cpartial vector! gust 24 lift 72 response 36 penetration 12 sudden 12 lift 84 oscillating 12 sudden 12 Positive Feedback Illustration Table 1 XI-8 Query Q 147: Will forward or apex located controls be effective at low subsonic speeds. Feedback Iterations Doc 1 Doc 2 Doc 3 Doc Initial Rank 2 3 4 5 6 7 8 9 10 LI 12 13 14 L5 Lb L7 J. 8 L9 20 21 22 23 24 25 2b 27 28 29 30 31 22 23 34 35 36 37 28 39 40 41 42 43 44 76 Positive Feedback Strategy for Query Q 147 Showing Improvements in Relevant Document Ranks XI-9 Fig. 1, giving averages over 42 queries for initial runs and two feedback iterations. The curves closest to the upper right-hand corner of the graph (where recall and precision are equal to 1) represent improved perfo:rmance. It is seen that the updated queries produced by the feedback operations exhibit a precision average 10 to 20 percent better than the original- queries for all recall points. Although positive feedback is often successful, (for example for query 147 of Table 2 ) , it fails to aid the retrieval performance of some queries. This occurs notably when no relevant items are retrieved, or when Performance may be Improved the retrieved relevant items are dissimilar. even under these unfavorable conditions by a negative strategy that moves the queries away from those items specifically not wanted by the user. An Illustration of the potential usefulness of the negative feedback strategy is given in Table 3, showing positive and negative performance for query 3. Here the positive strategy produces no improvement on the first iteration, and then promotes relevant documents 57, 31, 4, 30 and 32. while demoting item 33 which goes down from rank 124 to 194. The negative strate- gy, on the other hand, retrieves documents 4, 57, 30, and 32 on the first iteration by moving away from the nonrelevant Initially retrieved (documents 179, 42, 112, 39 and 117). A thorough experimental comparison in a collection of 200 documents between positive and negative feedback strategies reveals the following differences in performance: [12] a) the overall average differences in performance measured by the changes in rank of all documents strongly favor negative feedback, as is seen in Fig. 2; XI-10 0 A o A A o —o ^ Initial Scorch First Feedback Run (all effects) Second Feedbock Run (all effects) PRECISION 1.0 8 6 .4 • .2 1 1 1 L ^ » - • RECALL 2 . 4 . 6 .8 10 . Positive Feedback Performance (200 documents — 42 queries) Fig. 1 XI-11 Rank Positive Strategy Iteration 2 Rank Negative Strategy Iteration 1 2 3 4 5 6 7 8 9 10. 11 12 13 14 13 16 20 23 23 124 194 1 2 3 4 3 6 7 8 9 10 11 12 13 14 13 2022 23 23 27 115 124 Example of Improvements Obtainable with Negative Feedback CQuery 3\ Table 3 IM- FIRST ITERATION • INITIAL SEARCH o NEGATIVE FEEDBACK SECOND ITERATION o POSITIVE FEEDBACK PRECISION ii 1.0 .8 .6 .4 .2 0 1 .2 1 .4 1 .6 1 .8 L-k 1.0 Comparison of Positive ond Negative Performance Fig. 2 Feedback XI-13 b) the overall average differences measured by rank changes of unretrieved documents only are not statistically significant; c) however, the variance in the performance is always greater for negative feedback than for positive, indicating that for some queries negative feedback is better and for other queries it Is worse than positive feedback; d) queries retrieving no relevant document in an initial search (which therefore cannot be updated on the first iteration by any positive strategy) are helped by the negative procedure; e) on the average, the performance of queries that do retrieve relevant items in the initial search is not hindered by the negative strategy; f) the negative strategy changes the query vector much more than the positive strategy ( t e average correlation between initial /h and updated queries is about 0.8 5 for the positive strategy, but only 0.50 for the negative strategy); g) a plot of the average recall and precision differences between positive and negative feedback strategies is shown in Fig. 3; the following distinctions of 200 documents: are apparent for the collection i) if recall and precision are measured after tie retrieval of about 15 documents, the negative strategy is better by about 5% in recall, and about 3% in precision; ii) after the retrieval of 20 to 30 items, the two strategies are about equal; iii) after ^0 retrieved Items, the positive strategy is better by about 10% in recall and 20% in precision. XI-14 RECALL DIFFERENCE o-o FIRST FEEDBACK RUN •--• SECOND FEEDBACK RUN + 0.4 ' X N. +0.2 0 -0.2 -0.4J-0.6 h '*v 12 15 \ \ -08h \ • • " -^fc-rfc—^—rb—r^* 18 ^ 20 60 100 140 DIFFERENCES NUMBER OF RETRIEVED DOCUMENTS o) RECALL PRECISION DIFFERENCE + 0.4 +-0.2 0 -0.2 -0.4 .o—**8%« lz ^^*"« •-—. NUMBER OF hr*- RETRIEVED 140 DOCUMENTS 15 •^-\v 18 x 20 60 100 b ) PRECISION DIFFERENCES Differences Between Negative and Positive Feedback (averages 2 0 0 documents, 4 2 queries) Fig. 3 XI-15 This indicates that negative feedback retrieves more relevant documents within the top 10% of the document collection than positive feedback, but that the relevant documents remaining in the lower 70% of the collection are assigned much lower ranks by the negative strategy than by the positive strategy. Thus, in general, the query produced by negative feedback is closer to some relevant documents and at the same time further from other relevant documents than the positive feedback query. The evidence summarized above supports the following conclusion concerning the vector space of document representations: the documents selected by the user as relevant to his query are often found in two or more distinct groups in the document vector space; and these groups are separated from one another by nonrelevant documents. For a significant number of queries, this separa- tion of relevant document groups effectively prevents the retrieval of some relevant documents by conventional feedback strategies. [12] Consider as an illustration the document space of Fig. 4 . - Here docu- ments and queries are shown by points in the plane, and the distance between two points represents closeness of the corresponding subject matter.* Each query is assumed to retrieve all documents lying in a sphere around the query. The positive feedback illustration of Fig. 4(a) shows that the original query, identified by a circled zero, retrieves two relevant documents, one to the right of the query, and one to the left, as well as six nonrelevant documents. In actual fact each document or query must be represented by a t-dimensional vector, where t is the number of distinct allowable identifying terms; the two dimensional picture of Fig. 4 is thus a simplified analogy of the actual t-dimensional space. XI-16 A D ® Relevant Document Nonrelevant Document Query on Iteration N A Total Retrieved: Relevant 4 Nonrelevant 12 a) Positive Feedback X^^-£oj: 0 ^ ( / 0\Q 0 \ 0\ [oj fo^» 2 0 0 A 0 I Hj^ 0 V A/ L J - Total Retrieved: Relevant 4 Nonrelevant 14 b) Positive Negative and Feedback Feedback Negative >Fig. 4 XI-17 The expanded query, represented by a circled 1, retrieves additional documents, including a relevant one located to the left of the original circle. The new relevant item is used in a second updating operation to pull the query over to the left. The final updated query, represented by a circled 2, retrieves the three relevant items located on the left side of the picture; at the same time, two of the three relevant items on the right side are unfortunately lost. The same document space and query are processed by a negative feedback strategy in Fig. 4(b). Here the three nonrelevant items located just left of center move the original query over to the right, away from the nonrelevant group. A new updating operation then moves the query further away from the original position in the general direction of the relevant document group on the right. The negative feedback strategy thus retrieves the rele- vant items on the right side of the picture, but "loses" two of the relevant ones on the left. This type of retrieval behavior is illustrated for the example of query 9 in Table 4, where the positive strategy moves the query away from relevant document 8 2 when relevant document 116 is used for feedback. The negative strategy retrieves document 82 by using nonrelevant documents for feedback, but simultaneously, the query is moved away from document 116. If the high retrieval performance sometimes achieved by human subject experts is to be duplicated in an automatic environment, new retrievaJ. strategies must be specifically designed to select separated groups of relevant documents. Each of the techniques proposed in the following sections exhibit some advantages over conventional retrieval methods in the type of document vector space depicted in Fig. 4. XI-18 Rank Positive Strategy Iteration 0 1 179 112 39 42 181 45 62 116R 97 188 31 57 117 2 25 82R a 2 116R 179 62 102 181 39 42 117 3 45 115 2 158 0 0 0 82R Rank Negative Strategy Iteration 0 1 25 71 41 64 3 85 88 23 101 17 82R 0 116R 0 Q 0 2 25 71 41 3 98 178 82R 160 64 101 0 0 0 0 ^ 116R 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 IS 33 54 179 112 39^ 42 181 45 62 116R 97 188 31 57 117 2 25 82R 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 33 179 112 39 42 181 45 62 116R 97 188 31 57 117 2 25 82R Positive and Negative Feedback Strategies Query 9 with Separate Relevant Document Clusters Table 4 XI-19 C) Selective Negative Feedback The discussion in the previous section indicates that the use of retrieved nonrelevant documents for feedback often further lowers the ranks assigned to low-ranking relevant documents. This suggests that a more se- lective process might be devised in applying the negative strategy in order to improve overall performance. Under the present procedure, all terms in- cluded in the identified set of nonrelevant documents are automatically deleted from the query or reduced in weight. This process may lead to the effective loss of important query terms, particularly terms which may have more than one meaning in the document collection. The illustration of Table 5, covering a query dealing with data sets, shows that the crucial term "data set" is eventually deleted from the query." Two selective negative procedures are proposed. The first one, illustrated in Table 6, consists in assigning negative weights to terms extracted from nonrelevant documents while leaving the original query terms unchanged. Thus, in the example, the term "data set" is still present in the final query, but the related terms derived from the nonrelevant document set which suggest "sets of data" are assigned negative weights. The other selective procedure, illustrated in Table 7, makes use of a synonym dictionary, or thesaurus Cor alternatively an associative indexing procedure) to provide for each term a set of related terms. These related dictionary terms are first added to the query statement, after which the terms obtained from the nonrelevant documents are subtracted out. In the *;:"Data set" is an ambiguous term denoting both a communications device Cthe meaning assumed in the query)> and a "set of data" Cthe meaning derived from the nonrelevant document set). XI-20 Type of Vector I Illustration Please give specification for all currently available data sets. available 12 access 48 current 12 data set 12 file 24 list 24 specification 12 structure 84 1 a) Original Query b) Initial Query Vector c) Sum of Retrieved Nonrelevant Documents Standard Negative Feedback Result data set 60 current 12 d) available 12 specification 12 Example of Inadequate Negative Feedback Table 5 Type of Vector Illustration a) Initial Query Vector available 12 access 48 access 48 access -48 file -24 J current 12 data set 12 file 24 list 24 specification 12 structure 84 b) Sum of Retrieved Nonrelevant Documents Negative Context Vector (query concepts deleted) Selective Negative Feedback Result data set 60 file 24 list 24 c) structure 84 current 12 data set 12 structure -84 1 ' d) available 12 list -24 Cb-cl i specification 12 Selective Negative Weighting Table 6 XI-21 Type of Vector a) Initial Query Vector available 12 current 12 Illustration data set 12 specification 12 b) Concepts Related to "data set" with Correlation Strength structure (79), access (77), interface (58), line (52), file (50), sort (50), retrieval C49), list (47), transmission (30), band-width (28) access 24 file 24 interface 24 line 24 structure 24 c) Related Concept Vector (top 5 concepts with weight of 24) Query Vector with Related Concepts d) access 24 file 24 available 12 interface 24 current 12 line 24 data set 12 specification 12 structure 24 e) Sum of Retrieved Nonrelevant Documents Feedback Result with Related Concepts access 48 data set 60 current 12 file 24 list 24 structure 84 line 24 f) available 12 interface 24 specification 12 g) Related Concepts and Selective Negative Weighting access -24 available 12 line 24 current 12 list -24 data set 12 interface 24 1 structure -60 specification 12 Negative Feedback with Related Concepts Table 7 XI-22 example of Table 7, the thesaurus provides contextual information for the term "data set" used both in the sense of a communications device and in the sense of "sets of data"; the latter context in then eliminated by the negative feedback operation. Both of the suggested selective negative feedback strategies are intended to retain in the query the terms that might lead to the eventual retrieval of relevant documents, separated from the query by nonrelevant documents. Since the intervening nonrelevant documents are also retrieved, it remains to be seen whether these strategies improve performance for a significant number of queries. 3. Document Clustering When relevant documents are separated from each other by nonrele- vant documents, no conceivable strategy which uses a single query to search the complete document space can identify the separate sets of relevant items, while properly rejecting the nonrelevant documents located between them. A multiple query set might then be used, instead of a single query, in such a way that each "subquery" searches a distinct part of the document space. This, in turn, suggests that the documents in a collection be grouped into "clusters" of similar documents, and that each document cluster be searched separately. It may then be easier to discriminate between relevant and nonrelevant items within a given document cluster than in the document collection as a whole. Several methods exist for automatically producing document clusters in such a way that items sufficiently similar to each other are placed in the same group. [13,14,15] Such clustered document colllections can con- XI-23 veniently be used in a retrieval environment to reduce the search to a small portion of the document space by comparing the query against only those documents located within a specified subset of clusters. [16,17] Cluster searching can be performed in several distinct ways. The combined cluster search of Fig. 5(a) operates in such a way that all documents in the cluster set to be searched are ranked according to their distance from the query. Thus, the initial query of Fig. 5Ca) first retrieves six documents all located in the left-hand cluster, including one relevant item; a second search operation is then used to retrieve 13 more items. Alternatively, a separate cluster search can be performed, as shown in the example of Fig. 5Cb), where the documents are ranked separately within each cluster relative to other documents in the same cluster. The query then In the retrieves the highest ranking documents from each cluster searched. illustration the six relevant items are more efficiently retrieved in the separate cluster search than in the combined search, since the number of unwanted items obtained is only ten for the separate compared with thirteen for the combined strategy. The cluster searches shown in Fig. 5 compare all documents in all selected clusters with the same initial query. In order to generate a dis- tinct query for each cluster to be searched, it is possible to combine the notion of the cluster search with the query alteration methods used in relevance feedback. Specifically, a query alteration procedure can be utilized in which retrieved documents from separate clusters generate distinct queries, each of which operates within a distinct document cluster. The cluster feed- back process illustrated in Fig. 6Ca) is a partial search method of this type, XI-24 A • O <> 8 Relevant Document Nonrelevant Document Query Vector Document Cluster Centroid Document Cluster Second Search First Search -Document Cluster Totol Retrieved: Relevant 6 Nonrelevant 13 a) Combined Cluster Search Document Cluster Second Search First Search -Document Cluster Total Retrieved: Relevant 6 Nonrelevant 10 b) Seporate Cluster Search Single Query Cluster Fig. 5 Searches XI-25 A Relevant Document D Nonrelevant Document ® Cluster Centrold ® Query on Iteration N Total Retrieved: Relevant 6 Nonrelevant 9 a) Cluster Feedback m b) Split Queries Multiple Query Searches Fig. 6 Total Retrieved: Relevant 6 Nonrelevant 12 XI-26 The following principal steps are required: a) the original query (.designated in Fig. 6 by a circled zero) is first compared with the centers (centroids) of all document clusters; b) the clusters whose centroids are closest to the original query are then selected, and the individual document vectors within the selected clusters are compared to the query; c) relevance judgments are obtained for those documents found to be closest to the query; d) a new query is constructed for each cluster, using the original query as well as relevant Cor nonrelevant) documents from that particular cluster only — in the example of Fig. 6Ca) the original query (.circled 0) leads to two distinct new queries (circled 1) obtained by using the relevant documents from the right-hand and from the left-hand cluster, respectively; e) each new query is now matched only against the documents in its own cluster, and only the documents retrieved by a given query are used to modify that query in further feedback iterations; f ) * all documents retrieved from all selected clusters may be used to generate from the initial query a new centroid search query to select additional clusters to be searched; g);t; since more than one query is generated, some means of discarding queries that seem unlikely to retrieve additional relevant items would be desirable. Several possible criteria for eliminating such queries are suggested elsewhere [12] In the illustration of Fig. 6Ca), only nine nonrelevant items are retrieved together with the six relevant, -Steps f and g are not illustrated in Fig. 6 Gal. XI-27 The cluster feedback algorithm described above is equally feasible in combination with a technique called request clustering. This suggested alternative to document clustering assumes that documents formerly retrieved in answer to similar previous queries should be considered in processing a new query. In step a) the request cluster feedback algorithm would compare the new query to the centroids of clusters of previous queries submitted to the systems. The clusters of documents searched in steps b ) , cX, d ) , and e) would then include documents judged relevant to the queries in the query clusters nearest the new query. Request clustering allows documents which are adjacent in the document space to be placed into different clusters and nonadjacent documents to be placed into the same cluster. This may turn out to be advantageous in an environment containing separated groups of relevant documents. If the cluster search is to operate successfully, the retrieval problem (that is, the separation of relevant from nonrelevant) within each cluster must be simpler than the problem in the space as a whole; furthermore, the cluster selection method must pick few unproductive clusters to be searched. Should separated clusters of relevant documents still occur within one or more of the clusters, it may be necessary to construct multiple queries all of which search the same set of documents. A "query splitting" process designed to do this has been investigated with some success on a small test collection. [18] A query is split into two subqueries whenever the correlation between two relevant documents previously retrieved is small compared to the average inter-document correlation between the first five retrieved documents. An alternative strategy might be to split the query whenever a retrieved nonrelevant document is lo- XI-28 cated between two retrieved relevant ones; that is, relevant documents and v r are used to generate distinct (-split) queries whenever, for some nonn relevant item correlation (n^vO > correlation (r,v), and correlation Cn,r) > correlation (r,v). An illustration of the query splitting concept is shown in Fig. 6(b)., The original query (.circled Q\ first retrieves two relevant Items, one to the right and one to the left, whose interdocument correlation is small compared with the correlation of each relevant item to one of the nonrelevant in the middle. This leads to a split of the initial query into two pieces (.circled 1 ) , The sub- and to two additional queries (circled 2) after one more iteration. queries on the right retrieve the right-most relevant, and the left subqueries handle the relevant on the left. Both of the multiple query strategies illustrated in Fig. 6 remain to be tried out in a realistic document environment. 4. Document Space Modification The feedback procedures described up to now all produce a modifica- tion of the query space in such a way that queries are moved close to certain identified relevant documents, or away from identified nonrelevant ones. The stra- tegies suggested in this section attach the problem directly by permanently changing the document vector space. Specifically, the vector representations of docuThis ments judged relevant to a query are moved closer to the query vector. strategy is more radical than query modification, since it implies that the queries are more fundamental as subject indicators than the original docu- XI-29 ment identifying terms. Two different document space modification methods are illustrated in Fig. 7. In the first one, (Fig. 7(a)), the previously identified rele- vant documents are modified by addition of query terms as follows: d. , = Cl-a) d. + an t2) where d., - is the modified document, , the original query. d. the original document, and q A test was performed on this document modification pro- cess using a collection of 425 documents in aerodynamics, and a set cf 125 queries to effect the space modification. A new set of 30 additional queries not previously used for space modification was then processed with the modified document space, and improvements in both recall and precision of 30 to 15 percent were detected, compared with the use of these same queries in conjunction with the original, unmodified document space. These relatively large improvements appear to indicate that new customers whose relevance criteria play no part in the space modification profit directly from the query-document associations derived from previous system users. A second document space modification procedure illustrated ir. Fig. 7(b) is based on strategies previously tried in adaptive pattern recognition. [12] The basic idea is to pair each relevant document retrieved with a nonrelevant document not previously modified; if the nonrelevant item happens to be located closer to the query than the relevant one, an interchange procedure is used to move the relevant forward (closer to the query) and the nonrelevant backward (away from the query). process is as follows: More formally, the a) if for all d., such that dL is relevant to query q , XI-30 A D Relevant Document Nonrelevant Document <§) Query m m ® 0 m0 0 £2*^V u ^-A U 0 ^A 0 0 0 0 0 0 a) Relevant Document Modification b) Adaptive Document Modification Document Space Modification Fig. 7 XI-31 and for all d. such that d. is nonrelevant correlation CcL , q ) > correlation ( _ ,q ) + 0 d. _ no adjustment is made; b) otherwise, each vector cL denoting relevant document i is processed in order with q ; if there is a document k, not yet adjusted by a , and d, is not relevant to q and correlation Cd ,q X t 0 > correlation Cd.,q \ then d\ = (1 - a) cL + aa *k = U " a) ^ " «% where c is that previously unmodified nonrelevant item L having the highest correlation with q , and d! and d T , are the new adjusted document vectors; c) if no nonrelevant document k exists which has not been previously adjusted, the modification of the relevant item d. is still performed. This procedure is intended to produce a document space which groups all the relevant items around the corresponding queries, while moving the nonrelevant items further away. The space alteration is moreover controlled in the sense that a different nonrelevant item is subtracted out each time. The basic differences between the two suggested modification procedures is similar to the distinction between positive and negative feedback. The first technique adjusts only relevant documents, while the second alters both relevant and nonrelevant documents. A comparison of the two strategies XI-32 in the 425 document collection shows the superiority of the second method when a (modifier in equation C2)) is relatively small C-05 to .10). The ! advantage in precision of negative modification1 over ! positive modifica- tion1 is greatest at relatively low recall levels, reaching 4% at 20% recall. Both document space modification algorithms can easily be combined with the relevance feedback methods in an operating retrieval system to provide a continual adjustment of the document identifiers in accordance with the user's expectations. only the retrieved documents. The simplest procedure consists in modifying This modification could take place after the Only the vector representa- relevance judgments are rendered by the user. tion of the user's initial query would be used to alter the document representations. The proposed combined query and document space modification has not yet been tested in a retrieval environment. 5. Conclusion Several search and retrieval strategies are described in this study that use feedback information supplied by the user during the retrieval process to modify the query or document spaces. In each case, the space modifi- cation is intended to increase the correlation between queries and relevant documents, while decreasing the query correlation with nonrelevant items. Experimental evidence indicates that the improvements in retrieval effectiveness obtainable with these heuristic search strategies are much larger than the improvements immediately derivable from the more formal deterministic methods based on better document and query analyses and more sophisticated linguistic normalization tools. XI-33 References [I] C. W. Cleverdon and E. M. Keen, Factors Determining the Performance of Indexing Systems, Vol. 2 - Test Results, Aslib-Cranfield Research Project, Cranfield College of Aeronautics, 1966. G. Salton and M. E. Lesk, Computer Evaluation of Indexing and Text Processing, Journal of the ACM, Vol. 15, No. 1, January 1968. F. W. Lancaster, Evaluating the Operating Efficiency of Medlars, Final Report, National Library of Medicine, January 1968. G. Salton, Automatic Information Organization and Retrieval, McGraw Hill Book Co., New York, 1968. G. Salton and M. E., Lesk, The SMART Automatic Document Retrieval System — A n Illustration, Communications of the ACM, Vol. 8, No. 6, June 1965. J , L. Bennett, On-line Computer Aids for the Indexer, Proceedings . of the 1968 meeting of the American Society for Information Science, Columbus, Ohio, October 1968. M. E. Lesk and G. Salton, Interactive Search and Retrieval Methods Using Automatic Information Displays, Scientific Report No. ISR-14 to the National Science Foundation, Section IX, Department of Computer Science, Cornell University, October 1968, to be presented at the 1969 AFIPS Spring Joint Computer Conference, May 1969. H. Borko, Utilization of On-line Interactive Displays, in Information Systems Science and Technology, D. Walker, editor, Thompson Book Co., Washington, D. C , 1967, M. E. Lesk and G. Salton, Relevance Assessments and Retrieval System Evaluation, Scientific Report No. ISR-14 to the National Science Foundation, Section III, Department of Computer Science, Cornell University, October 1968. J. J. Rocchio and G. Salton, Information Search Optimization and Interactive Retrieval Techniques. Proceedings of the AFIPS Fall Joint Computer Conference, Spartan Book Co., 1965. G. Salton, Search and Retrieval Experiments in Real-Time Information Retrieval, Proceedings IFIP Congress 68, Edinburgh, 1968. Eleanor Ide, Relevance Feedback In an Automatic Document Retrieval System, Master1s Thesis, Cornell University, Report No. ISR-15 to the National Science Foundation, Department of Computer Science, Cornell University, January 1969. [2] [3] [4) [5] [6J [7J [8] [9] [10] [II] [12] References (contd) [13] R. E. Bonner, On Some Clustering Techniques, IBM Journal of Research and Development, Vol. 8, No. 1, January 1964. H. Borko and M. D. Bernick, Automatic Document Classification, Journal of the ACM, Vol. 10, No. 2, April 1963. R. M* Needhara and K^ Sparck Jones, Keywords and Clumps, Journal of Documentation, Vol. 20, No. 1, March 1964, J. J. Rocchio, Jr., Document Retrieval Systems —Optimization and Evaluation, Harvard University Doctoral Thesis, Report No. ISR-10 to the National Science Foundation, Harvard Computation Laboratory, March 1966. G» Salton, Search. Strategy and the Optimization of Retrieval Effectiveness, Proceedings of the FID/IFIP Conference on Mechanized Information Storage, Retrieval and Dissemination, North Holland Publishing Co., 1968. A. Borodin, L* Kerr and F* Lewis, Query Splitting in Relevance Feedback Systems, Scientific Report No. ISR-14- to the National Science Foundation, Section XII, Department of Computer Science, Cornell University, October 1968. T. L. Brauen, R. C. Holt, and T. R. Wilcox, Document Indexing Based on Relevance Feedback, Scientific Report No. ISR-14 to the National Science Foundation, Section XI, Department of Computer Science, Cornell University, October 1968. [14] {IS] [16J [17] [18J [19]