X-l

X,

A Relevance Feedback System Employing a Dynamically Evolving Document Space M. C. Davis, M. D. Linsky, M. V. Zelkowitz

Abstract Methods for improving precision and recall in information retrieval have been based mainly on query modification or temporary document transformations. The present study investigates results obtained

when, in addition to modifying the query, the document collection is considered as a dynamically changing space which is continually improved to reflect, more accurately, the contents of the documents it contains. This

alteration is achieved by reclustering the documents based on relevance feedback, so that future queries can benefit from the results of processing previous queries.

1.

Introduction Information retrieval is fundamentally concerned with the selective

retrieval of information which is pertinent to an inquiry from a large source of data. A comprehensive manual search covering even a small portion of

available information is clearly impossible when dealing with a large library containing several million volumes. Current card catalogue oriented systems

have proved to be useful tools towards the realization of more efficient, exhaustive scanning of information files, but intrinsic difficulties resulting from requirements imposed upon such systems by the alphabetical nature of these files either renders them unwieldy or incomplete. Innovations

X-2

in the computing field have led to the notion that the retrieval problem can pragmatically be coped with only by using computing devices programmed to simulate personal inspection of possible relevant information. SMART document retrieval programs are designed to accomplish such a simulation. Basic SMART retrieval involves two major procedures. Initially, The

syntactic and semantic considerations are employed to automatically construct a concept vector with numerical components which will essentially act as the query itself when future reference is made to that query. When a query

is processed, its concept vector is compared with the concept vectors of all of the documents in the collection (these document vectors are derived in a manner analogous to that for the concept vector for queries) and the cosine correlation, a measure of the similarity among queries and documents, is obtained. It is assumed that the probability of relevance

of a given document to the query at hand is greatest for those documents whose correlation with this query is highest. Thus, the user will be

presented with identification numbers of the documents with the highest correlations. Due to the syntactic and semantic impreciseness of the English language as well as the user's possible uncertainty pertaining to the exact information which he is seeking, standard means of condensing or reducing documents by automatic procedures such as, for example, statistical term associations and frequency counts of particular words and phrases, are not definitive enough to produce a space which is an exact image of the original documents and queries. Thus, it has been hypothesized that systematic

X-3

alteration of these indexing products based on user relevance feedback, can be employed to rectify the effects of any misinterpretations of user intent or document emphasis, to produce an "improved" or "refined" document space. For instance, studies have been made to evaluate systems

which require that the user return judgments indicating which of the retrieved documents are of value to him. Based on these personal relevance

judgments, the system then processes a modified query which reflects the feedback indications. That is, more emphasis is placed on documents

which bear a marked similarity to the documents previously found to be relevant. The expected improved results, which have been demonstrated

in [1], indicate that more relevant documents can thus be retrieved. There exists a mathematical justification to support the expectation that such query modification will result in more effective retrieval. Since documents and queries are vectors with numerical SMART

components, they can be considered as points in a vector space.

normally retrieves all documents in the vector space which lie "close" to the query (see Fig. 1(a)). Hopefully, modifications based on relevance

feedback can be used to move the query to a new position in the space. Ideally, a greater density of relevant documents will be centered about this portion of the space. Though such a query modification does rectify to some extent the imperfections in the concept vectors corresponding to the user's queries, it has no effect on the document space itself. Any inadequacies in the

original document space will exist throughout the life of the collection. It is, therefore, contended that the document space must also be altered if optimal results are to be obtained.

4

a) Typical SMART Retrieval

b) Typical SMART Retrieval with Relevance Feedback

c) Typical Retrieval from Modified Document Space

d) Typical Retrieval from Modified Document Space with Relevance Feedback

Retrieval Illustrations
Fig. 1

X-5

Implementation of a SMART-like system with the inclusion of the above query alteration technique will be notably unsuccessful in handling situations where relevant documents are clustered about distinct points which are distant from one another as in the illustration of Fig. 1. The query, when altered, will be moved to a position which

is close to or within one cluster, but far from the second cluster of relevant documents. totally ignored. This second group of documents will, therefore, be

The assumption that distinct documents which are found

to be relevant to a given inquiry are in fact interrelated, leads to the contention that a reclustering of the document space based on relevance feedback data, is highly desirable. If this is done, determination of a

single relevant document will easily facilitate the recovery of others. Whereas query modification involves a temporary change,since it is unlikely that a given query will exactly duplicate another one submitted at a future date, the proposed document space revisions are permanent in the sense that each updated document space is a refined version of the current space, not the original one determined by SMART. This report is concerned with the reclustering of documents within a dynamic document space and the corresponding effects of these modifications on retrieval results. Control cases are examined to provide a basis for the

evaluation of the effectiveness of the proposed method.

2.

Proposed Study The concept of a dynamic document space is not in itself novel.

The work of Friedman, Maceyak, and Weiss also involved the clustering of relevant documents [1]. However, unlike the presently proposed scheme which

X-6

retains the effects of the document space modifications during the processing of future queries, the document space treated in [1] reverts to its original form before the processing of the next query. With the proposed system, the methods which control the successive alterations of the document space are based upon the following assumptions:

a)

For a given query, concepts which appear more frequently in relevant documents than in nonrelevant documents probably contribute significantly to the relevance of the pertinent documents. The significant concepts are related to one Thus,

another and often occur in conjunction with one another. by raising the weights of these concepts in all documents within the entire space which contain occurrences of these concepts, similar documents are brought closer together; b)

Any relevant document (as determined by user feedback) which does not contain an instance of a given concept determined to be significant is likely to contain material which nonetheless relates to this concept. that relevant document. Therefore, this concept is added to It is expected that by increasing the

weights of these concepts, more relevant documents will be clustered together and ultimately retrieved, when a similar query is processed in the future.

It is difficult to determine an adequate criterion for deciding which concepts are, in fact, significant to the relevance of a particular document. A discrimination factor, d . , can be calculated from the quantities . n., where d., r , and n., are defined by equations (1), (2), and (3).
I

r.

and

i

i

i

r, a 7 y 1=

c u A

*keR ****• ' L

;

I = no. of elements e R

(1).

X-7

n. = - ) r
i J /
J

c

k,i

;

J = no, of elements e N

(2)

keN

d. = tr. - n. )/(r. + n. )
1 1 1 1 1

(3)

c R N

. is the weight of concept i in the kth document. is the set of relevant retrieved documents. is the set of nonrelevant retrieved documents• Thus r. is the average weight of concept i in the retrieved relevant

K ,1

documents; n.

is the average weight of concept i in the retrieved nonThe difference, r. - n, , if positive, is then a measure

relevant documents.

of how much more important the ith concept is in describing the nature of the relevant documents than in describing the nature of the nonrelevant documents. This measure, when normalized by dividing by the factor, r. + n,, A positive value for d.
l

becomes the desired discrimination factor, d..
l

indicates that the concept occurs more frequently in the retrieved relevant documents than in the retrieved nonrelevant and therefore is of some significance. Clearly, the larger the value of d., the more significant A concept is deemed

the concept is as an indicator of document relevance. "significant" if and only if d, > 6

(4)

where 6 is an appropriately chosen constant which specifies the minimum value of d., which demonstrates the "importance" of the concept. A reasonable approach to determining the proper magnitudes of the ensuing alterations is to define the increment as a function of d.. All

X-8

documents within the entire document space are then modified by the formula:

c

= c

. (1+yd.) for appropriately chosen y.

(5)

It is evident that if c affect its value.

, is originally 0, equation (5) will not

K,1

However, consistent with assumption b) above, for

documents deemed relevant, the absence of concept i will result in an alteration specified by equation (6) as follows: c, . « e for K e R
k,i

(6)

Since concept weights can never decrease with this scheme, they would grow unmanageably large over a long period of time if no provisions were made to check this growth. Consequently, the documents are all normalized This normalization process serves an

to a Euclidean length of 1000. additional purpose.

Concepts which are never significant, i.e., have

corresponding d.'s which are always negative or negligible positive quantities, are reduced in magnitude due to the increase in the weights of the relevant concepts. In a sense, negative feedback is thereby achieved.

The retrieval process can now be specified by the following algorithm:

a)

Retrieve the top 15 documents (based upon the cosine correlation with the query).

b)

Obtain relevance feedback judgments from the user concerning these 15 retrieved documents (Our experiments relied on a priori knowledge of the relevant documents to simulate this feedback procedure.)

X-9

c) d) e)

Compute r. , n., and d., from (1), (2), and (3). Set d. = 0 if d. < 0
l l

(7)

Process the collection and perform the transformation specified by (5).

f)

Repeat step a) with the modified collection and the same or different query depending on input specification.

3.

Experimental Results The basis experiment consists in applying algorithm (7),

programmed on the IBM 360/65, to the Cranfiald collection of 200 documents and 42 queries. In performing the experiments summarized below, two general conditions may obtain during the alteration of the document collection:

a)

The queries in each group have similar sets of relevant documents, (illustrated in Fig. 4 ) ;

b)

The queries in each group have different sets of relevant documents, (illustrated in Figs. 2 and 3 ) .

In almost all instances, several queries are "batch processed," before the collection is refined. Condition b) is probably more repre-

sentative of a real situation since many queries would normally enter a retrieval system before the collection is updated. The recall and precision indicated in Figs. 2 through 5 are typical of the results obtainable with the proposed system. The collection

was modified on the basis of prior searches using query 34 for Fig. 2, queries 12, 15, 16, 17, 38, and 41 for Fig. 3^ and queries 7, 15, and 17

X-10

; Rank i 1 1 2

Document 104 102 R 199 96 18 191 200 99 193 R 109 83 R 98 91 90 64

Correlation .2307 .2091 .1501 .1484 .1451 .1409 .1407 .1394 .1304 .1223 .1169 .1146 .1114 .1102 .0960

Rank 1 2

Document 102 R 104 199 18 191 193 R 96 83 R 200 99 109 64 90 91 98

Correlation .2273 .2262 .1447 .1420 .1409 .1398 .1363 .1349 .1328 .1264 .1192 .1124 .1095 .1091 .1064

3

I

4
5
6 7 8 9 10 11 12 13 14 15

1 34
5 6 7 8 9 10 11 12 13 14 15

1

a)

SMART Standard Retrieval

b)

Iteration Using Query 34 y a o.l, 6 = 0.9, e = 0

Rank

Document 102 R 104 193 R 199 18 191 96 83 R 200 99 109 64 91 90 98

Correlation .2290 .2276 .1543 .1444 .1440 .1418 .1359 .1339 .1313 .1282 .1182 .1165 .1091 .1083 .1057 |

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Document 102 R 104 193 R 83 R 191 199 18 96 200 64 99 109 90 91 17

Correlation .2380 .2295 .1455 .1438 .1427 .1420 .1387 .1352 .1287 .1237 .1215 .1173 .1090 .1088 .1076

i 1

1 2

1
1 1

3
4

5 6
7 8 9 10 11 12 13 14 15

c)

Iteration Using Queries 34 and 16 Y = 0.1, < = 1.0, e = 50 S

d)

Iteration Using Query 34 y = 0.2, 6 = 0.5, e = 50

Retrieval Results for Query 16 Fig. 2

X-ll

•

Precision
A

A

• Curve of Curve of Curve of A Curve of

Fig. 2(a) Fig. 2(b) Fig. 2(c) Fig. 2(d)

O.I h
J_ _L X X

i

»

i

i

i

r

Recall

0

.05 .10 .15 .20 .25 .30 .35 .40 .45 .50

Recall-Precision Curve Fig. 2 (contd.)

X-12

Rank 1 2 3 4 5 6 7 8 9 10 11 i 12 13 14 15 a)

Document 41 R 100 90 R 111 11 45 110 127 104 192 71 159 42 R 76 133

Correlation .4762 .4280 .3859 .3151 .3123 .2896 .2750 .2688 .2637 .2610 .2601 .2576 .2572 .2481 .2480

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 . 15
•

Document 41 R 100 90 R 111 11 45 127 110 104 42 R 159 76 71 133 185
. J

Correlation .4784 .4312 .3799 .3177 .3095 .2910 .2737 .2699 .2662 .2620 .2582 .2560 .2506 .2503 .2461

SMART Standard Retrieval

b)

Space Modification Y = 0.1, 6 - 0.5, e = 50

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 c)

Document 41 R 100 90 R 111 11 45 127 42 R 104 110 76 159 133 185 39

Correlation .4685 .4317 .3615 .3192 .3014 .2901 .2761 .2685 .2672 .2646 .2618 .2580 .2516 .2484 .2453

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 d)

Document 41 R 100 111 90 R 42 R 45 127 76 11 39 104 188 156 185 159

Correlation .4450 .4183 .3152 .3085 .2895 .2731 .2708 .2678 .2643 .2641 .2610 .2492 .2492 .2489 .2478
i

Space Modification y = 0.2, 6 = 0.5, e = 50

Space Modification y = °-5/ 5 a °-2> £ = 50

Retrieval Results for Query 7 (Results after iteration on Queries 12, 15, 16, 17, 38 and 41) Fig. 3

X-13

•

Precision

A

• Curve Curve Curve A Curve

of of of of

Fig. Fig. Fig. Fig.

3(a) 3(b) 3(c) 3(d)

A
0.8 0.7 0.6
\
> V*. #,

0.5 0.4 0.3 0.2 0.1 L_ 1 0.1 0.2
L. 1 1 I—— * • .Recall
•

0.3

0.4

0.5

0.6

e)

Recall-Precision Curve Fig. 3 (contd.)

X-14

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 a)

Document 80 81 102 66 82 69 83 88 125 193 114 94 111 124 11 R R R R R R R

Correlation .5190 .5021 .4696 .4537 .4218 .4036 .3966 .3831 .3807 .3660 .3578 .3533 .3369 .3341 .3213

Rank 1 2 3 4 5 6 7 8 9 10

Document 80 81 66 102 69 82 125 83 114 94 88 193 111 124 11 R R R R R

Correlation .5009 .4889 .4661 .4460 .4192 .4073 .3935 .3828 .3703 .3611 .3589 .3443 .3404 .3361 .3296 1 1 1 1

1

1

1

I

u
12 13 14 15 b)

R R

SMART Standard Retrieval

Space Modification Using Queries 7, 15 and 17 Y = 0.1, < » 0.7, e - 50 5

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Document 80 81 66 102 69 82 125 83 114 94 88 193 111 124 11 R R R R R

Correlation .5091 .4964 .4661 .4505 .4193 .4104 .3935 .3883 .3704 .3611 .3606 .3503 .3404 .3362 .3296

R R

c)

Space Modification Using Queries 7, 15, and 17 y - 0.1, 6 = 0.9, e = 0

Retrieval Results for Query 15 Fig. 4

X-15

Precision A

— Curve of Fig. 4(a) • Curve of Fig. 4(b) • Curve of Fig. 4(c)

"a-®— a
0.4 0.3 0.2 h 0.1 [-

L
0.1

•*-*- Recall

0.2

0.3

0.4

0.5

0.6

d)

Recall-Precision Graph Fig. 4 (contd.)

X-16

i

'

-

•

•

•

•

•

•

i

Initial Search

Figure

Queries

Relevant _ Documents 10 127 128 129 130

New Search Following Space Modification Relevant Queries Documents 16 67 80 81 82 83 84 85 102 193 41 42 72 90 95

Change in Recall and Precision

2

34

Increase

3

12 15 16 17 38 41

46 48 50 52 54 67 81 83 85 87 90 92 94 102 120 192 194 41 72 81 83 85 87 90 93 95 109

47 49 51 53 55 80 82 84 86 88 91 93 95 109 178 193 195 42 80 82 84 86 88 91 94 102 193

7

Increase

4

7 15 17

15

80 81 82 83 84 85 86 87 88 102 109 193

Decrease

1

1

Summary of Results of Space Modification Fig. 5

X-17

for Fig, 4.

Following these space modifications,

searches were performed

using queries 16, 7, and 15 respectively. One result is immediately obvious. Whenever the collection is

modified using non-related queries (queries with different sets of relevant documents from those of the query now at hand), recall and precision increase. However, whenever the collection is first modified

using queries with relevant document sets related to those of the query being examined, recall and precision decrease. The latter result may

be explained by the fact that although the clustering of relevant documents is accomplished as desired, the centroid of the generated cluster moves away from the query (as in the example of Fig. 4 ) . Specifically, when a space modification is performed, there exists no a priori reason for expecting that the new document space should be such that the relevant documents are clustered around the present query, thus yielding improved precision and recall, since the scheme which controls the document space alteration is independent of the query. However, the relevant document cluster must be located at

points in the space which are close to the query being processed if increased precision and recall are to be achieved. The addition of query

modification to the system is therefore necessary to assure this closeness; the query will be moved towards the relevant document cluster, thus facilitating improved retrieval results. In order to verify the fact that the relevant documents are grouped together,by the space modification process, the document-document correlations of the original space and of the modified space are computed

X-18

as in the example of Figs. 6 and 7.

These results confirm the fact that

the related documents are indeed grouped more closely together than they were originally. For example, in Fig. 6, Query 16 retrieves only relevant

documents 83, 102, and 193. However, the intercorrelations among documents 80, 81, 83, 84, and 88 all increased markedly. While it is true that

documents 67, 85, and 102 are essentially unaffected by the modification process, 5 relevant non-retrieved documents were clustered as desired. Fig. 7 offers another example of similarly successful clustering. An additional experiment was conducted to demonstrate that the proposed system, when expanded to include query modification in addition to document space alteration, leads to the desired increase in precision and recall. Specifically, given the modified document space, relevance

feedback results are used to modify the query in a fashion similar to that used by [2]. An updated query, Q 1 , is determined from an original query, Q, using the following equation:
£R.

2' - Q +

ISRJ

(8)

where the R. are the relevant documents.
i

The denominator is used to normalize the changes in the modification procedure so that the query is not altered too radically. If this normali-

zation were not carried through, the incremented concepts would nullify the effects of any of the components not affected by the modification procedure. Fig. 8 demonstrates the above assertion. Fig. 8(a) represents the

original SMART retrieval with the original query; Fig. 8(b) represents retrieval results using the original query and the modified collection. The

X-19

m as

O CM CM

00 VO CM

r- vo as r-

^

oo LO LO vO CM VO r- o LO1 r^ VO CM 00 ' F s oo rH VO LO LO ^r oo rH CM o o ^r CM O rH CM CM CM CM 00 00 ^t1 ^ CM CM CM CM r^ LO 00 00

*

^
as

LO

o

CO LO ^r o rH VO o r^ o oo CM ^rjvo CM rH oo vo rH rH rH O M C M CM|vO VO|00 CM CM CM CM 00 00 00 CM CM CM

o
^

o

LO

as V0 rH V0 00 rH as o 00 O 00 00 ^r oo
00 00

00 00

CM

o

CM CM

r- CM CM LO 00 O VO CM LO rH oo CM as r^ as o 00 VO O rH oo oo as r^ rH 00 V0 CO KD ^r O CM VO LO CM ^ oo as o rVO LO 00 00 ^r LO CM CM
00 00 00 CM 00 CM

vO

oo oo

rH V0 VO CM vO O CO O0 00 vO VO LO oo r^ r> vo O CM ^ o V0 OS O oo 00 00 VO OS rH ^T CM oo oo ^r 00 00 CM ^F LO VO LO VO ^r LO rH CM VO CO 00 LO ^r vo ^r CM OS CO rH vO O CTi CM LO CO CO CO ^r CM Os OS rH VO LO CM CM CM 00 00 00 CM CM VO LO ^d1 LO CM rH oo ^ LO

*

*

*

0)

U
0)

3 OJ 0 -P -P
J^

Q

«^

*

CO

O ^r oo

rrj
d) rH 0)

>

VO CO

CM LO O r- CM vo VOIVO r- ^r CM CM 00 00

r^ VO \> * CM 00 00 rH to O rH OO O OS CT> CM CM oo r^ r- o CM CM
00 00 CM 00

K

&
0)

G 03 CL, 04 W 0 4J TJ 0) - -H rH IW -H II T 3 O ^O g rH • II JH

§ a) 3 o

— £ 0 -H -P rrj rH
C D

+J
aj

a
£
p

LO 00

CM [^ CM OS 00 00 LO LO ^r 00 CM 00 CM VO LO o o LO LO CM CM CM CM LO O

r- co ^

as ^r oo o

U U O

0 Q

o

o a) u tQ II § C
>P C

-H vO < D

CO

^ 00 Os KD r- CM O LO CM 00 ^r as V0 CM CM ^r r- oo 00 00 LO LO o o ^r LO
OS O0 OS ^ rH 00 rH 00 CM ^r LO LO LO LO

*

U O 4-1

Ul

C
-H -4J

o

-P 0 •H -P ^ -P 0 VOJ3 rH

vd g

rrj C D U O C -H

a)

tP -H Pn

m
00

CO CM

rrj
rH

vO vO

(D M
rH

> i <D -P 5 U H £ < rrj rrj D

•*.

1 f W -u t H &*
tp

CM CO

CM LO CM rvO CM OS LO O rH CM CM rH r- LO LO

*

O CJ -p
0)

c
B 3

tP G -H 0) 3

^r oo oo
00 O CM CM

o

o

o

Q I -P

o o

rH - H rrj G G tn - H -H tn C O -H * M U O O rrj XI 1 1

*d C D
0) VH

o
00

^ Os LO LO CM CM

a Q)
O

"^
0 Q

vQ

o o o

KD

O 00

CM CO

P0 CO

<tf
CO

LO CO

vO
CO

r-

CO

CO CO

CM

o

as o

oo as

ill

Document 8

8 1000

13 191* 212 1000

58 186 177 335 332 1000

59 233 219 318 326 456* 522 1000

60 000 014 279 274 504* 526 235* 304 1000

200 183 163 132 130 384 389 446 450 086 108 1000

13

58

59

60

200

Document-Document Correlations for Documents Relevant to Queries 4 and 5 (Space modification y = 0.1, 6 = 0.7; * Significant increase)

Fig. 7

X-21

| Rank ; l 2 3 4 5 6 7 8 9

Document j Correlation 165 162 164 58 163 60 110 150 92 167 127 R 198 10 R 185 128 R .6559 .5773 .4740 .4772 .3741 .3721 .3615 .3570 .3497 .3381 .3181 .3162 .3058 .2971 .2888

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 b)

Document 165 162 164 58 60 163 110 150 92 167 198 185 127 R 10 R 129 R

Correlation .6485 .5713 .4675 .4337 .3670 .3642 .3581 .3475 .3451 .3353 .3122 .3002 .2921 .2757 .2682

1 10 11
! 12 13 ! 14 i 15 a)

Standard SMART Retrieval R = 0.9364 P = 0.5960

Retrieval Using Modified Space and Fixed Query (y = 0.1, 6 = 1.0) R = 0.9887 P = 0.9044

Rank ; ; i 2 3 ! 4 : 5 6 i 7 8 9 10 11 12 13 14 15

Document 165 10 129 162 127 164 58 128 163 130 167 14 150 92 110

Correlation .6434 .6377 .5872 .5609 .5502 .5422 .5013 .4581 .4386 .3840 .3819 .3792 .3788 .3776 .3676

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 d)

Document 10 165 129 127 162 164 58 128 163 130 92 150 14 110 60 R R R

Correlation .6521 .6331 .6114 .5801 .5688 .5461 .5067 .4592 .4345 .4014 .3918 .3911 .3786 .3721 .3587

R R R

R R

R R

Retrieval Using Relevance Feedback (modified query, fixed space) R = 0.9867 P = 0.8621

Retrieval Using Modified Space and Modified Query (y = 0.1, 6 = 1.0) R = 0.9887 P = 0.9044

Retrieval Results for Various Combinations of Space and Query Modification Fig. 8

X-22

results given in Fig. 8Cb) are not as good as those in Fig, 8(a) since the document space has been reclustered, but not around the query vector. Fig. 8(c) represents the SMART retrieval with the modified query, whereas Fig. 8(d) represents retrieval results using the modified query and the modified collection. As expected, the results indicated in Fig. 8(d) are

better than those in Fig. 8(c), the standard method of relevance feedback (using only query modification). In analyzing the effects of the various chosen values for 6, y, and e, it appears that best results are obtained for small values of y and For 6 = 0.5, too many significant d.fs are generated,

large values of 6.

not all of which were actually important to the relevance of the pertinent documents. Since as many as 18 to 20 concepts were present with corresponding

d.'s of 1.0, large values of 6 such as 0.8, 0.9, 0.98, yielded the best retrieval results. With y = 0.1, retrieval results on successive iterations However, since the collections

were almost identical to those of SMART.

were being reclustered, impressive results could be obtained with query modification,. The documents were greatly altered when large values for y

were chosen; the corrections to the concept weights adjusted these weights too drastically to be of value. Concerning the modification to £, the initial weight of new concepts entered into a document concept vector, it appears as if the addition of e ^ 0 to relevant documents has some effect; however, the effect is not appreciable, unless, of course, e is set to some unreasonably large value.

X-23

4.

Results and Conclusions The results of these experiments indicate that by use of

the discrimination factor, d., to guide the redefinition of the document space, documents are reclustered, and are subsequently brought closer together. However, the reclustering does not necessarily This explains the result that

take place around the original query.

initial space modification without query modification is not necessarily as useful as an original SMART search would be. As soon as relevance

feedback is further employed to modify also the query, the relocation of the query leads, however, to an improved retrieval when the next search is accomplished. The next step to be taken is clearly the general

incorporation of document space and query .modification into a system such as SMART. An aspect of the study which has not been fully investigated to date is the practicality of a system which centers around the frequent updating of a large document collection. Ideally, of course, However,

the collection should be updated after each query is processed. this is certainly a very tedious process.

In the experimental study

of the 200 documents in the Cranfield collection, a full search on the 360/65 took about 15 seconds while updating the collection took about 10 seconds. By batching the queries in groups of three or four, these

processing times are reduced to 6 and 9 seconds for the searching and collection refinement procedures, respectively. It may be possible to

modify the collection while a full search is going on, thus reducing the processing times still further. That is, after computing the

cosine correlations, the document is modified before writing it out

X-24

again, thus eliminating one complete reading of the document collection per iteration. The appendix describes the programming system written to carry out this study.

References

]

S. R. Friedman, J. A. Maceyak, and S. F. Weiss, A Relevance Feedback System Based on Document Transformation, Scientific Report No. ISR-12 to the National Science Foundation, Section X, Department of Computer Science, Cornell University, June 1967. E. Ide, User Interaction with an Automated Information Retrieval System, Scientific Report No. ISR-12 to the National Science Foundation, Section VIII, Department of Computer Science, Cornell University, June 1967. 0. W. Riddle, T. Horwitz, and R. Dietz, Relevance Feedback in Information Retrieval Systems, Scientific Report No. ISR-11 to the National Science Foundation, Section VI, Department of Computer Science, Cornell University, June 1966. G. Salton, Class Notes in Information Storage and Retrieval.

]

]

]

X-26

Appendix The Programming System

The programming system used consists of the five subroutines MAIN, SEARCH, GET, SET and UPDATE, which serve the following functions:

A) 1.

MAIN Reads in the collections (documents and queries) and stores them on a disk.

2.

Calls SET and then SEARCH. SET - Changes the PSW in the 360 so that exponent underflow messages do not appear in the output listings.

B) 1. 2. 3.

SEARCH Reads in a card containing the next set of queries to "batch". Reads in the actual query vectors into core from the disk storage. Reads the document collection, and computes the cosine correlation for each document with all queries. The results of this search are sorted and printed. Calls GET to read into core the relevant retrieved documents. Calls UPDATE to compute the d.'s and update the collection. Reads in the next batch of queries, if any, and repeats the search procedure. If there are no more queries, then control passes back to MAIN, which terminates execution. GET - reads into core the needed documents or queries based upon the a priori relevance feedback.

4. 5. 6.

C) 1. 2.

UPDATE Computes and prints the discrimination factor d. for each concept. Updates the document collection.

X-27

The collection to be read in has been modified from the original SMART collection in order to be compatible with Cornell University's COOL system for the 360. Columns 1 to 72, only, are used on each card.

The format of the collection is as follows:

Card 1 - columns

1-16 17-20 21-24

Title of collection NCOL - Number of cards in collection NTDOC - Mumber of documents in collection

Card 2 - columns

1-8

'NO MORE1

Next follows NTDOC blocks of cards, one for each document. for each block is:

The format

Card 1 - columns

1-4 9-12 13-16

Document number Number of concept weight pairs Number of a priori relevant documents

Card 2 - Concept weight pairs, 9 per card, 4 columns for concept number, 4 columns for weight. Document vectors are nor-

malized to an Euclidean length of 1000. Last card - A priori relevance information, 4 columns each. (For the document collection, this card is blank).

The very last card in the set, immediately after the NTDOCth block, contains
f

END' in columns 1 to 4.

This is used as an error check

whenever the collections are read. The collections are organized so that 'A' format is used whenever the collections are read.

X-28

In order to run the program, the following data cards are used: Card 1 - Columns 1-10 Maximum concept number in collection (X10 format) Value of GAMMA (F10.5) Value of DELTA (F10.5) Value of Epsilon (F10.5) Number of retrieved documents to sort and print (110) Do not punch out updated collection at end of retrieval. Punch out collection (onto a Cornell Data Set) Title card

31-40 41-50 51-60 61-70

Card 2 - Column 1

0 1

Columns

2-72

Beginning with card 3 - Document collection, followed by query collection. The remaining data cards are the actual search cards, one card per iteration. Each card has the format:

Columns

1-4

Number of queries to batch process

5-8, 9-12, 13-16, ... Query numbers, in ascending order.

At most 17 queries can be batched at once. to the size of the arrays set up in SEARCH.

(This arbitrary number is due In order to batch more than

17 queries, the size of ARRAY and INDEXR must be increased). The printed output consists on the top retrieved documents (as determined by the number in columns 61 to 70 from the first data card), including the a priori knowledge of the relevant documents. This is After

followed by a listing of all of the query-document correlations.

listing the results for all of the queries which were batched at one time,

X-29

the d.'s are printed as the collection is updated. In addition to the basic retrieval programs, three independent utility programs exist. The first lists the document collection.

This routine is needed whenever a modified collection is punched onto cards and it is desired to see how the concept weights have been changed. A second program computes the document-document correlations, given a document collection as input. The third program modifies the query and computes the correlation of the modified query with the documents in the space given the original query and the a priori known relevant documents as input. This routine effectively performs relevance feedback. At the present time, these three routines are independent of the main retrieval system; however, it would be relatively easy to incorporate them as subroutines of SEARCH in order to generate an effective retrieval system, which would also include the standard relevance feedback process.