1-1

I.

The Cornell Implementation of the SMART System D. Williamson, R. Williamson, M. Lesk

Abstract The systems organization of the SMART programs is discussed as implemented for operation in a batch processing mode oa the IBM 360/65. Covered in particular are the basic input and text analysis routines, the document clustering programs, the search routines and the feedback operations. Sample computer output is shown in each case to illustrate the

operations.

1.

Introduction The SMART system is designed for the exploration, testing and mea-

surement of proposed algorithms for document retrieval.

The system takes

documents and search requests in English, performs a fully-automatic content analysis of the texts, matches analyzed documents with analyzed search requests, and retrieves those stored items believed to be most similar to the queries. The request authors (users) can submit information to improve

their queries (relevance feedback), and this information is used by several experimental procedures to improve search results. The time required to

match large collections of documents to requests can be reduced by grouping these documents (clustering) and matching requests against a representative of the entire group. Finally, exhaustive evaluation procedures can be used

to ascertain the effectiveness of various methods used in searching. Several important criteria are incorporated in the implementation

1-2

of the SMART system. [1]

The requirement for mixing different processing

methods, such as clustering, relevance feedback, and searching, implies that the programming system should be written in terms of many small blocks, in such a way that any one process would be synthesized by assembling several, blocks into one unit. In this manner, not only can a process be carried

out using many different combinations of methods, but a change in any part of the system does not require major alterations of the other parts of the system. The fast processing speed necessary to process large collections is

gained by making it possible to process several queries in parallel.

2.

Basic System Organization The SMART information retrieval process can be divided into five

basic sections:

the input of printed text, the grouping of documents for

searching purposes (clustering), the selection of a group of documents to be searched, the searching of the document group, and the evaluation of the search. The printed text specifying the queries and documents must be converted into a form more easily handled by a computer. For this purpose

various automatic language analysis devices can be used which reduce each query and document to "concept" vector form. To produce fast searching algorithms, documents can be grouped into classes of similar documents. The grouping (clustering) is done by placing

documents containing similar concepts together, into the same group; a representative central item is then constructed for each group. The search of a document group (cluster) is done by first matching requests against clusters. Certain clusters are picked as most likely to

1-3

contain documents of interest.

These documents are then searched in the After seeing some retrieved documents, the

normal manner, one item at a time.

requestor can modify his request, either by physically changing it, or using the requestor's relevance assessments to automatically modify the query. Several measures of retrieval performance are computed to evaluate each search. The sign test, T test, and Wilcoxon Rank Sum test are also used

to determine the significance of the evaluation measurements.

A)

Input of Printed Text

The first section involves the reading of text (e.g., abstracts, queries) and the conversion of a given text into numeric concept vectors with weights. The conversion process may involve the use of suitable dicAt present, A more

tionaries, thesaurus, and other language normalization aids.

a relatively simple PL/1 program is used to implement this section.

flexible Fortran IV program Is planned for later implementation as described in report ISR-14 [2] and included in the system flowcharts, part 4 of the report. The presently available text-handling program, LOOKUP, is a procedure which performs dictionary lookups on a large IBM 360-series computer. It accepts a dictionary, suffix list, and texts and produces "conWords missing from the dictionary are also

cept vectors" for the texts. processed.

The algorithm is essentially that of Sussenguth [3] although LOOKUP is designed primarily

the tree structure storage format is not used.

for ease of programming, and is coded entirely in PL/1. The overall operation of LOOKUP is divided into three parts. First,

the dictionary and suffix list are read into memory, sorted alphabetically and necessary initialization is performed. Secondly, text is read in, divi-

1-4

ded into words, and the words looked up in the dictionary and suffix lists. Third, the concept numbers derived from the words in each document are sorted and condensed into a properly weighted vector. printed and/or stored in machine-readable form. The vector can be

The lookup program finds

a match between an input word and a dictionary entry under the following conditions:

1) 2)

the word exactly matches a dictionary entry; or it matches a dictionary entry with a final
n

e n dropped

and a suffix beginning with a vowel added; or 3) 4) it matches a dictionary entry plus a suffix; or it matches a dictionary entry with a final "y" changed to 5)
n

i n and a suffix added; or

it matches a dictionary entry, with a final consonant doubled ana a suffix added.

When several possible matches are found, the match involving the longest stem is preferred; within stems of the same length, preference is in numerical order as above. Thus, if "cop", "cope", and "copy" are all stens in

the dictionary, and all normal English suffixes are included in the suffix list, "cops" is found from "cop" under rule 3; "copes" or "coping" is found from "cope" under rule 2; "copying" from "copy" under rule 3; "copies" from "copy" under rule 4; and "copper" from "cop" under rule 5. Other morpho-

logical features of English are not recognized; such word pairs as "mouse'' and "mice", "sing" and "sung", "fight" and "fought", or "court-martial" and "courts-martial" must be entered explicitly in the dictionary if both members are to be recognized. Special rules exist which specify that all sterns

must be at least three letters long (to avoid, for example, finding "wing'

1-5

from "we" under rule 2 or "inning" from "in" under rule 5 ) ; furthermore, all words are truncated at 24 characters. The program can distinguish titles from the body of the text, if asked; and it may either split the weight of an ambiguous word among its concept numbers, or weight all concept occurrences equally. The suffix list

may be omitted from the lookup, in which case only words that exactly match a dictionary entry can be found; and the programmer may choose whether hyphenated words are to be considered as a unit or as separate words. the previous SMART implementations, concept As in

numbers of zero or concept num-

bers of 32000 or more are considered to be nonsignificant and are dropped from the vector. Fig. 1 shows a typical output of LOOKUP. First the title is given, The resulting

and then the text of the document (or query in this case).

numeric concept vector Is next printed, consisting of pairs of concept numbers followed by the respective concept weights (for example, concept 927 with weight 12, 2574 with weight 12, etc.). tor in increasing numeric order. Concepts are listed in the vec-

B)

Document Clustering for Search Purposes

At present two clustering algorithms are in operation at Cornell — CLUSTR, which uses Rocchio's clustering algorithm [4], and DCLSTR, a variation of Doylefs clustering algorithm. [5] Rocchio's clustering algorithm is based on the following methodology: an unclustered document is selected as a possible cluster center. Then, all of the other unclustered documents are correlated with it, and the document is subjected to a density test to see if a cluster should be formed around it. The density test specifies that at least N documents should have corre-

r\j|r\l *v

^J

?

O IT
NO

rvj

a m

< X
\
2:

<\j!f\J <N C"
•-*

LLI U J
i~ *—

!<"

«%l
~*4

z x x
T
oO O

as o
<r ^
001 0 O

a

* • —
• '

V,
^s ;

*
"3

:2

o

^

—

z u
MJM Mi

I**
!_i
i
i

cr
^f

< a Ti ^ & <T ZD
-J

'X '

j - J *—.

Zi
r~4 !

Its > ^ !a uo LL ; or 2 *
O JJ T U- X c£ on

* iuu •

LU

•-' • » ^ "v rg 0

t
O
— t

2^

>x
k2

3>

*-*

H-

o o

o!

nri •
-J
X

<<
;3T

< , z 1
2:

a cU o h- < ">
!

2:
»—4

<
LU

0
•—»

c^ * <r
*. • »
^ V*

<r c^
0 0
•—«

Q ;
00

z ^
• . / )

00

rvj <\j n j o

!

JJ

X

0

n j :rs|
*-4 ir-1

LU

LU

2: a

LX

'IC JZL * •— 1
I -*

i

~D •— • J <r ' O 5: 2 r
r-^

CL

;< C I -J o <
<T . C CO
LJ

O i r+\ vO ! • n ! ^ LTN i

i IC

LJ

:

jz
<
-J <l
~4 ! _ I —I

LJ; 2:
LU

\ ! v. r\i >o ^c 0 r^ 0
^
+•

• * : —
QL

>O
<T

:LO

O LL

Ott
«k

>—*

u

V.
—1

Xk,
—4

< uo !z ~
.5

> a:

'^>

(V r\j

b
LL

oO
LU -J

OJ LJ

a <
<

LU w j »< X 1 ~

h-

s X) ^ O
0 ^ ro 'kT\ ^ in *

DTI .—. A^ h X C£

LU

oc o .J

a

k/>|(%J LO ,

\~ <
LU LL i

LU 2 :

oc ^:

Oi LU

G
oo f\j|f\l n j o

a:

u >
»—< <

» ^ v. ^
rvj
M
i-4 f-4

<

00

a Lu p
0
•

ac'K — O - J u:2a;
a u j Q

oo - J

o

o z < r*» ho ro o cr;o in

H-

X

0 0 »-M 0 0 OJ 0 0 OO h -

* « 5 . k—

N

S

— ~ » .>

2: - J LU

< oo
< a

C* LU

< *-•

o

*-

*-i

UJ.O Q!LL 00 ??!«£ JJ *-« !•—* 0
00
Q

xjac

M

rO CD C ^ >f T

-* cr

- J - J 2 : 00

>

»—1

< »<
-1
OO

<
O

•—•
oO
OO

-0 ^n
•t

< »—<
O

Z ro

>

•Ul 2T

<t ;LU

< u
3 <

Z — a:

:

LLJO * O0 Q iaJ 2T O < (X LU >• OO CO . J

a z a— oc

Orlco

Si
o
-J

J> i«— Z

QC

r-

Q ! ^ •-•

O
•-

#-*» H- ^> a X LU 00 O LU —

O «-•(/) <
^4 LLlLU
* 1

o

z o a:
a

O
UJ

— O Z > *- or 0 ~ < 2: >^ LU X w

>

LU C2

JJ

>

»

^ O ct

1-7

lations higher than a specified parameter and that at least Cp^ N

p

with the document in question, p

documents should have correlations higher than

is generally larger than

p ). This test ensures that documents on the If the document passes

edge of large groups do not become cluster centers.

the density test, thus becoming a cluster center, a cutoff correlation, p . , is determined from the cluster size limits and the distribution of The cutoff correlation becomes p if fewer documents p. . If more

correlation values.

than the minimum cluster size (M ) have correlations above

such documents exist, the cutoff correlation is chosen at the greatest correlation difference between cluster size. A classification vector is then formed by taking the centroid of all the document vectors having correlations above p . . This centroid vector M adjacent documents, where M is the maximum

is matched against the entire collection, and the cutoff parameters for cluster size are recalculated to create an altered cluster. As a result of this process, some documents may appear in more than one cluster; and some which were in a cluster when the centroid was originally formed may not remain in any cluster. These documents, as well as

those which failed the density test, are termed "loose", and those within the cluster are termed "clustered". This entire procedure is repeated with all unclustered documents, the first pass terminating when all items are either clustered or loose. Figs. 2, 3, and 4 Illustrate the formation of a cluster. Document 2 is

first correlated with all previously unclustered documents in the collection (9 documents of the 82 documents in the collection had previously been clustered around document 1 ) . The correlations are ranked, and the ranks, docu-

1-8

CLUSTERING ABOUT DOCUMENT RANK RANK DCC CORR 1 6 11 16 21 26 2 68 12 34 50 19 77 53 26 42 21 43 C 0 0 1.0000 0.2512 C.1990 0.1697 C.1445 0.1239 0.0934 C.0854 0.0583 C.C460 0.0394 C.0337 C.COOO 0.0000 C.OOOO 2 7 12 17 22 27 32 37 42 47 52 57 62 67 72

2 CCC 64 61 55 33 82 6 81 78 58 15 35 36 0 0 0

CORR C.4CC2 0.2475 C.1867 0.1689 0.1420 0.1235 C.C934 0.0748 0.0578 0.0454 0.C385 0.0335 0.0000 0.0000 0.0000

RANK 3 8 13 18 2i 28 33 38 43 48 53 58 63 68 73 27 18 14 22 48 30 32 25 38 49 54 44 0 0 0

DCC

CORR

Jl
36 41 46 51 56 61 66 71

0.3631 0.2367 0.1361 0.1634 0.1400 0.1006 0.C921 0.0729 0.0539 0.C43 7 0.0374 0..C283 0.0000 0.0000 O.GOOO

RANK
4 9 14 19 24 29 i4 39 44 49 54 59 64 69 39 29 66 16 9 8 67 75 7 80 63 0 0 0

DOC

CORK
5 10 15 20 •25 3C 35 40 45 50 55 6C 65 70
1 S

RANK
41 71 73 69 23 65 17 24 10 79 59 0 0 0 o

ouc
0.2623 0.2174 0. 1715 0.1511 0.1257 0.0934 0.0880 C.0665 0.C467 0.0417 0.C347 0.0000 0.00 00 O.OCOO

CORR

0.3466 0.2258 0.1749 0.1515 0.1257 C.0934 0.0891 0.0691 0.0511 C.04 3 2 C.0353 O.COCO 0.0000 0.0000

DOCUMENT 2 M S PASSED THE DENSITY TEST CUTOFF WILL BE CHECKEC.

The Testing of Document 2 as a Possible Cluster Center Fig. 2

1-9

Q

M ^ N ^ M ^ i T i ^ r v ) ^ < o rr in iP H o o o o o o o o o o o o o o o

o u o

M J * O ^ ^ 0 0 f l 0 O M H ^

O O ^C >^

4 ->

cu

<

o o
Q
TJ

a)
•fCO ' I

cuj

<H

ex

c

in >r f\i rsj r\j rg *-4 — —i ^ o 4

o in r- o m i/> in,co x ro cc ^ ro ^ oc inn^>TMOoo^fNjHcrcDNOinfn
<-> o O o

^co*-«r-ao*rr-o^coaoc\)r*-a>r\j

a:

vO
v

OD

f\J ro

H

c

>fr ir a

in r^ rr >f *-•
0 0 0

o
. i - i
HI

a o o c n » - ' 0 ~ m > t H Q cr N in >t
>t>fr<^c\jPvJ*-«—*p— — « r - » 0

ooooooooooooooo
o a
Q

a o o o o o o o o o o o o o u c a
^ H M r. m c r m o o o r o r ^ - o ^ m r ^

rC I '
•l I

>r oo in C cr >r c o m u> r\j r*. r- p- >r ^ M >Oi*-4 incvj ~* m r- >f >o ro m

n )
•. I

o
',1

f\J ir- Psi

r-

-^

eg h» c\j r~ C\J h» < \ j r- OsJ ~ l r\j r\J *o r* *r v t i n m NO

r* C\J o r-

m o i r o m o m o i n o m o m o •-i^cNirsimro^T^cninoof**

a cu a o r, o
•i i •H

w

fU

a o

r- a* *-4 o O f \ i m >* o r~ < \ j ao - o m r\j \*> i n & oo cr sO GO 00 cr og o ro JN oo fVi sO ao i n m o oo >r ro .—• o o ^0 i n ro m 0 < \ j C\J — « • • 9 r - « — o o o O • o o o O o o o o • • • • • o o o
1—I
f—4

N r o i n o ^ o ^ ^ ^ ^ ^ m N M oc o
2:
H ro s0 m

r

I

(1)

m >t <\i oo <\J *-* • , ^ ^ ,_* o ' O o —

» - ' O ^ i T i ^ ^ c ^ M ^ ^

M

o

o

o

O

O O O O O O O O O O O O O O c* « i o >o in oo v p - l O s j i n ^ v O ^ ^ — C m o ^ vO^ooaornooooNfrror0

o ' ) • a)
rJ:

I -

CM

o
LJ Q
i

oo o m r- >r o m m r-4 m

GO r*

rvi! r*
i

ro >r NO o >o -r m . - 4

vt in i n sf

m
f-<

o

<

^HMM(^"Ovf>tinin>o^

a:

< oc

1-10

O

O

~* ~* co c\i f\j m

of N
NO
i—i

r\J

*o
cr
f\J

D

II ^ ^ r O ' N J ^ P O r o ^ r o — < ^

o o o o

h- oorvjr\jsOcoror\ir\jf\J(Vi

oo
h-

X

o
^ -^ ^ rsi -\i r\j >t
O

»—•
•—i ^-4 *-» r\i f\j pr,
LU

o o • r-

CD

II

h-oaooocococoro^ro^sO

HTS
o
M

o u

QLU

uO

h-

00 43

00 00 CD

C\l r-4
^H

O
(^)

rvj^NOGorgvjnoN^TnLn
^ 4 ^ 4 * ^ 0 s | (\J >^
O ^ ^H
P-J

<
(\J f\j

LU

<
CJ

fV

UL

.^

r^

C D

a

7*

» >rc\jr\4C\jco<\J<\jr\4CVJ(\j<\j —

ted C
C D

SUM

c LU
UAR
o
UO

\'
3 H

Lfl

<r
LU

2rcoor^r--*ac>o>oro^>r
<^J

>
^

UL

H

H

P <

IV

(\J

0«
^J

p

-*(N

>

t,^(^

^t 00 *-*' >0 0 s
_|
,—1 fSJ f \ j f\]

<
X
LU

o
:r
oo hO

rsj

o o o o o
Q

O

H- >fr >fr <\j r\J *o "* >t og, <N oo eg ^rr\joOi-i^mc\iro^^«-»—•

h-OOONr<M>OC\if\J<NJ>too

O ac <
oo
h-

>

v0

a
(X

CM

o
^)
QL
F-I

<r
X h1 1

2:oomo>tQooo-^r»->crn
^
H

aU L o

CM ( M f t )

O

P-I H

(\j rvj r\j

LU

o o

2

•* —
X

rg
LU

EVA

Com
Q)

1-11

ment numbers, and correlation coefficients are listed in Fig. 2.

In the

example, at least 10 documents (N ) must have a correlation greater than 0.15 (p,) 5 and at least 5 documents (N ) must have a correlation greater than 0.25. The correlation of document 2 is larger than 0.15 for 19 other Docu-

documents, and for 5 other documents the correlation exceeds 0.25. ment 2 therefore passes the density test. therefore p . M

in this example is 5, and

is calculated by finding the greatest correlation differ-

ence between adjacent documents, starting with the document of rank 5 (at least M documents must be included) and checking differences up to The largest gap occurs between M

documents (in this case 15 documents). ranks 7 and 8 — therefore p . r min

is taken to be 0.2475.

The classification vector (called the centroid) is formed by merging the document vectors of documents having correlations above p . (0.2475). This

The centroid, composed of concepts and weights, is shown in Fig. 4.

centroid is then correlated with all previously unclustered documents (Fig. 3 ) . A second cutoff correlation belong in cluster 2. M and checking until p . p . is calculated to determine which documents

Here the greatest correlation difference (starting at M ) occurs between the documents ranked 11 and 12.

Therefore cluster 2.

becomes 0.3859, and the top 11 documents are included in

These documents are listed as the "11 Relevant" in Fig. 4. The following descrip-

DCLSTR uses a variation of Doyle's Algorithm. tion of the algorithm covers the main points. [5] set is arbitrarily partitioned into documents in cluster concept vector C. j . m

Assume that the document S. is the set of

clusters, where S.

Associated with each set F. .

is a corresponding

and frequency vector

The concept vector consists S. , and the frequency in which each concept occurs.

of all the concepts occurring in the documents of vector specifies the number of documents in S.

1-12

Every concept in

C.

is assigned a rank according to its frequency;

i.e., concepts with the highest frequency have a rank of 1, concepts with the next highest frequency receive a rank of 2, etc. value), every concept in D. Given an integer b (base

is assigned a rank value equal to the base value The vector of rank values is called the pro-

minus the rank of that concept. file P. of the set S. .

Fie. 5 illustrates the concept and frequency vec-

:

u
m clusters, the

tors, and the corresponding profiles for a sample document collection. Starting from a partition of the document set into profiles are generated as described. Every document m d.

in the document

space is now scored against each of the where g(d.,P.)

profiles by a scoring function g, d

equals the sum of the rank values of all the concepts frDm C. .

which occur in

Fig. 5 shows the results of scoring the documents in the

sample collection against the profiles from Fig. 5. A new partition of the document set into by the following formula: S. - {d. |g(d. ,P.) > T.} H. - [a-(H.-T)]
l i

m+1

clusters is then made

1 < j < m H. > T
l

if

T.

I

T

otherwise

where

H. = max(g(d.,P.)) l ° l j 0 < a < 1 _ _ T = a is the given cutoff value

Those documents which do not fall into any of the

m

clusters L .

S.

are called

loose documents, and they are assigned to a special class

The process is

1-13

S
d

l
l 3 5

C

l l 2 5 7 8

F

l

p

i

S

2 2 4

C

2 l 2

F

2

P

2

S

3 6 7

C

3 3

F

3

l

d

2 l 2 4 5

d

3

\
C C C C

d

5 l 8

d

6 3

"Si
C C

"pTI
5 5 5

d d d

C C C C C

3 1 1 1 2

5 3 3 3 4

d d

C C

2 2 1 1 2

5 5 4 4 5

d d

C

1 J 1

I

1
C

C

C C

l 7

l 2 3 5

C C

C

6 8

C C

6 8

2

C C C

° 3
°4 <
C

!cs

°8

5

a) Documents

b)

Initial Clusters, Profiles, and Frequencies

Construction of Profiles from Documents (base value = 6)

Document

Profile of Highest Score 2 2 1 2 1 3 3

Score

d d d

i 2 3

15 19 12 19 9 5 10

\
d d d

FT 5
d

11
d7

L
d d

5 6 7

3

d d

i 2

5 6

b) Resulting Clusters

One Iteration of Doyle's Classification Algorithm (cutoff = 10) Fig. 5

1-14

now repeated after replacing S.

P.

by

P! .

The iteration continues until S! = S. (actually S*! = S* ,

satisfies the termination condition that S* ] is the subset of S. :

where

consisting of all those documents that

score highest against profile P . ) . Basically, this algorithm matches documents to existing clusters by computing a document-cluster score for each document with respect to each cluster, and placing a document into those clusters for which a sufficiently high score is obtained. documents. The clusters are then updated to include the new

In each iteration all the documents are correlated with all the

clusters, and the clusters are updated until further updating does not alter the group of documents in each cluster. form in Figs. 6 and 7. This updating is shown in list

The 12 profiles (clusters) of Fig. 6 are matched

against the documents, and updated to become the profiles of Fig. 7. It should be noted that the document clustering process can be extended to the clustering of clusters. algorithms generates m That is, if one of the two clustering m n groups could be clusters, where

groups of documents, these

grouped together, as if they were documents, into 1 <_ n < m . _ These n

clusters could then be grouped together, and so on, At present

until a hierarchical cluster tree is formed as shown in Fig. 8.

no routines for automatically constructing such multi-level cluster trees exist In the SMART system, although such an algorithm is planned for implementation in the near future. Both CLUSTR and DCLUSTR generate the first

level of the cluster trees, thus representing special cases of more general tree construction routines.

C)

The Selection of Documents to be Searched First a search query is

The search process consists of four steps.

1-15

O

in

rO

o

0>

Al X

X)

O

M

o

rsj

.n

<> x

o

t

o o

- 4

~4

—4

-4

-*i

r-^

>•
JO

J)
JO

uV *n —4

o
<f
-4

3
AJ
—4

m
—4 -4

f\J
w>

>
N. -4

-«4

J0

--4

^

o -4

iJJ ^

>t JO

UJ

n
JO

LU ^NJ

JJ

<f

UJ

X

X

-•

X

rn
—4 —4

JJ

^r
-4 —4

JJ

LU
JT:

TO
X)
—4

<
-4

<
og

< -* m
3
L>
* * •

< -•

m

X

o£

<
s\

<

^:
3

X X

-4

^
in
~4

JJ

<
X)

<r
r^LU 3 LU
- 4 —4

o
3
-H
3
- j —4
.—4

r*LJ

< c> r

a>

<

o ^: m --4
3
—4

JJ

X

r^ O

LJ

< ~* o

rx: <t
si

PL , i

rO LU TO

N.
JJ X)

LU

3 X 3 JC <\J a 30
<£
</>

3

JJ h-J

JJ 3 J.

~4 -4

-J
LL

^1

—4

J0 J J rs. 3
X

ao x>
^4

JJ 3 3 LU

o

JJ O ^J - ^

"NJ JU S\

hi) •H

3 X 3 X

c;

x
3

a
<£

*

J0

3

X 3 X a
£L Jl

LL sf

*~ h-

3 X

3 3
- 4

a.
<z

a.
^
^0 h-

*

a* L
a.
«z

7*

^

3 X
QL

X 3 X

x>
T\

•o

a. ^
—4 —4

, 3 aC og X K<-

sO

in —4

J X •7* a . •n

X 3

x

m
•—4

3

a. n <z
y> hLU

z.
no uO

^
h-

->:

-> ^o >r
rO Z -^ LJ 51 3 3 . "O 3 O O -4
JJ

-4

»-»

X)

"Z

^ o *• o 2
OJ
2L

O

;*- r^
3

h-

s> M O 3J
JJ

•+s
>J>

o0

^)
hLJ

oO
u

^
UJ

> o —
JJ

ii
r\
3

^r

» o —
2L
JJ

Ln

^

^
3
--4

^ ^

m
m - 4

^) o
•— wT>

o

?z
JJ

JJ

^

<r

o JO o O
o
J

o

3 J 3 O
JJ
-u.

3
j

3
-0 ^1

o f*3 ^
;
u

3 3
JJ

o o
3 Q
JJ

sJ*

51 3 3 r~4 3 >t 3
J

3 3 3 3
JJ
H»

T

^

3 J D 3 J

I'O

^

3 i 3
;

T m M in

"^

5: 3 O 3 3
JJ

O
^-4

3 3 00 3 >f Q
J -r*

E
*— -4
KM

h-

<t

!i.

r

^

o

**

^

r

N»

,z
• —

E
--4

— •

o

v— J\ M

r

--4

>

E
p—

3 O

^ ^.

; mm

O r^

\»

f\

E — »

^5
^i2

E i—

>j O

r—

K*

^

3 M

^
M ^

^
4-

lies (Clus ters
4H

o o o-4

o

ao

^4 ~

-n r\

3

0

"PUT

^ <

1-16

Iht o7 U lrtc co Int 7:> iHt o2 THc ol Tiic ZS Tri<19 IHL Jl Tht -to The oU |nc io<: Triu <+{

OuCUMErtTS ol 83 UGCUME^TS 66 7U UuCUrtEiMTS 76 77
OUC-L.ME.MTS

IN

PROFILE 84 67

1

AKt 10U AKt

128

IN

64

PROFILE 2 86 124 PROFILE 112 134 3

IN

74

AKt 161 AKt

IN

9

116

147 IN

PROFILE 4 i5l 197 PROFILt 113 120 PKOFILt 115 121 PRUFILt 5

3

DOCUMENTS 90 110 OGtUMtNTS 41 11<* JUCUrttnlTS 39 UULUiMtiMTS 57 t>8 DOCUMENTS 132 133 DUCOMEiNlTS IDO lt>u DUCUMEiMTS 163 lo4 JUCUME.MTS «*6 49

AKt 181 AKt lb3 ARE

IN

6

18

192

195

IN

7

2

IN

f

PRuFlLt b 187 186 9

ARE

Z6

II^ PKUFILE 158 159 IN
PKUFILE

ARb 179 AKt la9 AKt 167 ARE

185

10

t>o

182 IN

164

198

ti

PROFILE l l l£>5 Ibo PRJFlLt 12

168

IN

4o

Updated Profiles (Clusters) Fig. 7

1-17

cluster root <^S centroid d document

/i\
ddd

n / i \ //\\
dd ddd d d d d

/\
dd

f\ f\ r\
dd dd dd

A H y p o t h e t i c a l C l u s t e r Tree Fig. 8

1-18

defined, either using the authorTs original query, or a modification of the original query, in numeric concept vector form. One important modification

consists in using documents judged relevant by the author to modify the original query vector. This process is known as relevance feedback and is dis-

cussed in part D of this section. Once a search query is defined, the set of documents to be correlated with the query is selected. SMART provides two options; either a full In a full search every document in In this case the selection

search or a tree search may be made.

the retrieval base is correlated with the query.

of the documents to be searched is trivial — all documents in the collection are searched. In the tree search [6] , a set of documents is selected by a cyclic: process, using a tree such as that pictured in Fig. 9. At any one time, a

set of active nodes exists in the tree; initially this is the set of roots; (the highest level of clusters). with the search query. Each node in the active set is compared

The "goodness" of each node is defined from the

relatedness of a query to a node, and from other information about the structure of a tree; the nodes of the "active" set are then ordered by this value of a "goodness". promising. A subset of active nodes is selected as being most:

The corresponding nodes are deleted from the active set, and

the sons of these nodes (if centroids) are correlated with the query and become a part of the active set. Those sons which represent documents are:

then entered onto a list of documents to be used in subsequent correlatior.s with the query. The active set is cyclically reordered and another group

of nodes is selected to have its sons examined until some desired number of documents are located. The process used to obtain a list of specific docu-

1-19

At s t a r t

H
d d d dd d d d d d d d

.2

O O O 00O
dd dd dd dd A f t e r one c y c l e

D

E D

y\\
d d d dd d d d A f t e r two c y c l e s

/\
d d d d dd dd dd da

o oo o o o
D
:7\ Z6\ dd //.i d d d d d d d dd

0
5\ dd

•
S.2\ dd //.I dd

d d d

After t h r e e

cycles

D
DDD

0
5\ dd dd D D D d d d d

•
X2 dd DD d document

o oo o
dd I root D \ ycentroid selected document

Legend:

The value of the goodness of each centroid in the active set is indicated inside that node. A centroid selected for expansion is indicated by a doubled border.

Searching a Hypothetical Tree Fig. 9

1-20

merits to be directly compared to a query is represented in Fig. 9. The listing reproduced in Fig. 10 shows an example of input to the cluster searching routine. The first iteration (Iteration 0) uses a full The second iteration (Iteration 1) repre-

search instead of a tree search.

sents a tree search on the cluster collection "CENTROID NO MORE" using the "COSINE" correlation. The desired number of documents to be selected for

correlation with the query is given by "WANTED", where "WANTED" is defined as:

WANTED = "CORDOC" + "TIMALL" * "ALLOF" + "TIMREL" * (the number of relevant documents not yet retrieved) + "TIMNMR" * "NOMOR"

where

CORDOC

implies that at least "CORDOC" documents will be correlated in this iteration;

TIMALL

"TIMALL" times "ALLOF" (for this iteration) documents are additionally correlated in this iteration;

TIMREL

"TIMREL" times the number of relevant documents not yet retrieved are additionally correlated in this iteration;

TIMNMR

"TIMNMR" times "NOMOR" (for this iteration) documents are additionally correlated in this iteration.

"ALLOF" and "NOMOR" are user-supplied constants indicating how many documents are used in relevance feedback. Therefore the second and fourth

terms of the parameter "WANTED" are constants, like "CORDOC", for a given iteration. These constants are expressed by three parameters (rather than

1-21
o
o
«M * *»

o

o o

00

O O*1

z o
en

r- r- ao r-

a. a.

I
m oo |

o — o o c

z
D

z a

M

*- >oi~< >* rI
i

w
C D
T)

I

> _| O O

p
|o
:<
!

\
|

O

s: a. a. a a.

-i < z -> >

OJ

a

a: a: a o
C •H

S

J

-• ^ c o T

a m
CD CO

>
0 0 LU

•
r-4

CO

u

O

•H [ •

a z
a a:

a o

oo

<u
(1)

o
<i>

H o
T

O
hi) •H

3

Z Q

•H cy
CO

14) In

a
U. Z > ; > 3 O -J -J

_j u. z ! u .
'J 3 -2 - I O v j i v j J 4 J. Z ' U . 0L Q. Q. a - l O .

(/) u (\)
4->

a3 o
Q
W
•>

a)

o a

J

o o o o o
r
i-«0 O OlO

r, m c o

j

• • • •' •

o

•H

a
a)
LU :> 3. < r-j

-3 I

0) CO

x o

CJ Z

<
a
- l O -J I-I Z J U . ZIU.

z >!»

IS 1°
Z Z I

O

,u < o o l o

|!3

z
3

3

•-• >< QC
a»

OO l/>

z a

o
X

o
•-

u
3

O
UJ

iW

l»—

— !< * *-• *

'z

z l^ o 3
• ^

><

<

1-22

being lumped into one) for user convenience; most users will set "TIMALL" and "TIMNMR" to zero. The parameter "TIMREL" allows the number of documents

searched to be related to the number of relevant documents previously not found. The "goodness" of each node (the parameter value used to rank the nodes) may also be controlled by the user through 17 parameters as follows.

"PCN" "GOODNESS" = "COEF" + "MCN" x "CROWN" "PFCFCN" "PNCFCN" + "MCFCN" x "COEF" r r ^ r u i N x "CROWN" r m r u i N
+

n

PLV n

+ "MLV" + "LEVEL"

"MCNLV" x "CROWN" n P N C N L V ? f x
t l p p p p ] " ytf
rrL IijV

"LEVEL"™^'
tfpyppT ytt

+ "MCFLV" x "COEF"

"

x

"LEVEL"

+ "MALL" x "COEF"" P A L L C F " x "CROWN"" P A L L C N " x "LEVEL"" P A L L L V "

where "COEF" is the correlation value (usually cosine) between the node and the query, "CROWN" is the number of nodes that are the sons of the node, and "LEVEL" is the level of the node. For example, the node in Fig. 9 wi'ih

a "goodness" of 0.9, has a "CROWN" of 11 and a "LEVEL" of 3. It should be noted that the formula for "GOODNESS" contains many combinations of "CROWN" and "LEVEL", making the formula extremely flexible for experimental purposes. It is expected that most users will use only

two or three terms, most parameters in "GOODNESS" being usually set to zero. The size of the subset of active nodes to be expanded (after all active nodes are ranked by "goodness") is determined by additional parameters specified by the user, as printed in the listing (Fig. 10) under "SELECTION" and "REJECTION".

1-23

MINNOD MAXNOD GAP

At least 'MINNOD" nodes and not more than "MAXNOD" nodes are to be expanded for this iteration; If there efcist two nodes between "MINNOD" and "MAXNOD" which have a difference greater than "GAP", all nodes above that gap are expanded;

EPSLON

Any nodes within "EPSLON" of the last node selected for expansion are also to be expanded;

MNGOOD

Any node with a "GOODNESS" of less than "MNGOOD" Is not to be retained for expansion;

MNCORR

Any node with a correlation less than "MNCORR" with the query is not retained for expansion;

PERCOL

Only nodes whose combined "CROWN" is greater than "PERCOL" percent of the size of the collection being searched need be retained for expansion.

TIMWAN

Only nodes whose combined "CROWN" is greater than "TIMWAN" times the number of documents to be correlated with are retained for expansion.

The selection of the documents using the parameters from Fig. 10 is shown in Figs. 11, 12, and 13. The queries are processed as a batch, as examples. The queries are

and queries 31, 32, 33, and 34 are shown

first matched against the "roots" of the centroid tree consisting of centroids 1, 2, and 3. The results of the matching and other useful statisThe query number, iteration number, number of

tics are shown in Fig. 11.

documents wanted and found, centroid used to match, "goodness" of the matching, statistics of the centroid, and the cosine correlation are given for each query match against all the roots. The "REJECTION" parameters

are used here to eliminate centroids before any ranking is done on "goodness". A "MNGOOD" of 0.10 causes centroid 2 to be dropped from the active

set of query 33, and centroid 3 to be dropped

1-24
QC

o

C O

o
cr

C Ca a c o c o o o o z
Z Z — 31 3" X

> >>
cc cc ao
O D Q UJ U J L b

a a O

CL

a
a. O

O

a a. o a.

a a o a:

a a o a

cc 3

Q Q Q UJ UJ UJ

cr. co m
C C O

o o
LU L U

c c
LU L U

O Q O LU LU UJ

c c
CL X X

<< a
LU LU

2 Z

<< a. a.
x x
LU L U UJ LU CO CO

a o z z

<l

c z z —. <

.Z
i—.

•a

Q. H - i X LU L U ! LU of: of: LL. U J i u CO CC COj

r^ ro ro o
UJ r\j •-• o

*-«! >r
— «

O

- * f\j

f\j

cr <\j
00 O >0 LO

>t
~* ro

>* —• c\j >r IT* * - * f\| CC t \ i - * O f\J

LL LU CC CC

H-

r-

u

3 U

o o o o a c
o

o o o o o o c o o o o o
r\j c\j r\j r\,

U

o c
r\i f\i

.CUM

^

r-

Nf >C

.-« rc ^

l O v 0 eg CM vO C^

CO _ U 2" -J

0 0 »—

Z oo
JC U J

C

CJ

a: z -? a
Q

U
>o _

• • ••
_ _ _
_ l
r-l

CC i-O O O J CNi LU —1 v 0 > j r\|

—»

_ «

~* >o
>J- Ovl

>t

^

>T >}•

LO — i O f \ J sj- r j

Z
-J

Exp nded
fd

• • ••

O O O uo

o c

c:o
H»-

o o
r-

CD h- • -

C) L
jq

0
4 ->

C O C D 0

H
r H

r- ro m o
f\J ^ 1 J

— « >jr-<

o

r-t

r\
f\j t\|

<r

*H

r\i vt

cr rvj ao o
sO l O

u^ r\J —1 O

— .•

•x
(\J

oO LU Q CD

• •

•

o o

^

O ^ >T ri n ro r\l f\j

^
r-t

cr
ro

>J f \ J Cs

.
cO x
•H

-. O
- ^ ^H

o
C3
f\l <M f \ J ^ J

O O

• •

. o. o
r \ j —•

<• -^ . O. O •
i-T\ r\j O ro f\j

—cc ro • >^

• " • <

MH

0

U^

r:

O.

rvj ^-4

-•
r 1

o
T U_ C* Z UJ

Q J CO

f\j

'NJ < \ , f \ j

AjfMfvjfvj

f \ j f \ j <\j ( \ j r\j fNJ oj LM

^1 u.

o o o o

o o o o
3 X

u u

LU ^ > 3L < Z3 X O LJ r~

<\) f\J fNJ f\j

fvjrNfNJfM

'\jr\jr\j'Nj

o
O LJ rZ (\J i^J fN

<

' *, ' ' r> r i i m r i

I I I I ^ ^ ,,, r i

I I I I ! ro ro ro m

>
-3

I

U

xxjxx o
X OOIOO - 7 ^ 1 ^ 1 7

x x x x
o o u 0 1 - 1 , - ^ - r -

x x x x
O O O H- r- rO r-

o
X

X

•2
CCCOCO'D

<

K < < XiCOCOCO

< < i -a <
CQcOCOcOl

Si
CO

X LJ <l CO

CO

rst

U

u

tio

1-25
z c
OO Q

!C
CJ

O O o z 3"
I I I
in I

o o z

o o o o z
I

C

* >
QD

o c o z z

L2 Z X

o o o

> ~
£E

>
CD

•
CC

I

a a o
QL

a. a lo la:

a a o
QL Q UJ CD

Q-

Q.

Ct

a o

a O QL
Q UJ

o
v0

1°
LU CO

r
LU CC

I

I

ac

z u

G

T

I

a

<
1
oo U oo Q
•—<

2 sf ® ! co

r-vOoomcocoQo^^x

1
oo Z LJ 00 Q

1 oo Z oO O
l _ 4

1 oo 2 CJ ; oo

1 00

! o oo c
LJ LU

o o o o o o o o oo o
(/)
<l) M I

2

•

' 6
Q

2

a

b!
>
0 0 LU 2 -J

o o oo o o o o o o o o o o o o o S 8 S 5 S o 5 5 oo 6

LU O r^

» — z

n or

— •* U ^ h^ LU
o
r^

o
Cf •— LU LJ ro

_ O

O

i

z

•— 1 Z LU LJ

a: !

o a:
•—

z.
LU LJ

a
cc z o :r o a:
LJ

> o o o —*'<*-* o *o o ^ ^ ^ ^ ^
OP
c
•. I -I J

0

m!

m

<
'I !

rvj co O l o fNjm i n > t o c o ^ 5 o ^-im ^ 0 C o r - . - « - 0 ^
• • • • • • •

r N ,

o co ^ Orv.
* * •

0 I ' C D CO Mi

z
Q
CD O

D

O

o o o o o o o o o

^>o

o

(/) :-.
^ ^ mm o ^ r - f ^ x ^ c r o o ^ ^ •H
[i, '1 i

a an

3 or

O

o
•I I CO
0

CM rvj

«M.CM

^

^

^

^

^

^

^

^

^

^

^ Hi

oo

LU ^

(),
X o o oloo
i

ZUJ
UJ > 3: <i

o o o o o o o o o o
I i

3 X

JC

ct

a-

- L _

^ ^ .-.-« - • * * - -"• -« -

I

>

^ P( n>|r^ m rn m m m r. n, m rri

ni

m

I

CJ

x

<
co

CO

cu

<

^

( X J C O d D J c O c O C Q a D C D C O C O C Q C O c D c o c O

<
CD

<

i

1-26
O* ro f\J O o j * • r^

z o

m 00

Z

o

oc o» o o* *-« >*- o

IT

r^ r*- cc r^
-H ^ ^

c

i

- « m sD

or i—

z c
CJ C Q Q C
UJ LL UJ Lb U J

a i

^

(\J lf> 0 0 —I (\J > *

m cc
- • IT

o o o c5 o _
Ll- u -

c c c o a z z z z z
< < • < < <

UJ LU

UJ

a

*f

z •-

a a a. a. a .
X X X X X

<<
LU

C o z z z z z — ~ ~ < <a < <i <
a. a
X

»- h- h-

r\j ^ -*

„j- >o • rvj >*• I

-. a -

'U •a

x uar

cc oc co cc cc

CC CC

X U_ LU UJ u- U J oc oc cc u - UJ U J UJ UJ CC CO CD CC CC

o a o c c

a c

c c a c a
I or Z

c rd a, x
CO CD X) O

r£ Z Z u O 4-> 00 CD X) O

O «0 N I ^ N t—i *-* r^ >j- m

>* >r
«~
f\j

cc cr o ^o r\j >r ir» sC c^

O

a a
7-

3

z
13 C

3" C
£2

DC C/0 O UJ

O

0) 00 Q)

CD O >0 — —• r\j
NO NJ"

^

cc — r» o «

>0

C_

a
Q

a

H
UH

O

MH

IT

CO O

<\I

O

CQ

CO

ac en cc r- >r o so ^ >r rv
m
CNJ

<\j r\j

r\j

r- >r o o
r\j

*f <N o >$• C > 7 sO ^ CC CC ^

+J (1)
f/J

r*- cc ~* o
O
r\j <M r\j

>T
~*

~>

TJ

0 •H 00

oo o o o
3 r- cr to o _>

o '_) <D o C

a a,
^
r\j cc <\i —*
;

C D CO
.J

X U

0 c a o
•H 1 ' O 0)
r- (

(1) CO

a.

<
rvj rvj

U HLJ 2 -

I

x u

X

u

X

X

X

3D

<
CO

<
co

<
CO

1-27

from the active set of queries 31 and 32.

Similarly, a "MNCORR" of 0 . 05 .

causes centroid 1 and centroid 3 to be dropped from the active set of n-ery 33. The centroids remaining in the active set of each query are the ked 1).

and the "SELECTION" parameters used to select nodes to be expanded (F:. Note that query 33 has no active set remaining and therefore is dropped from further searching (a more careful set of paramecers for "goodness" would have eliminated this problem).

The "SELECTION" parameters indicate that at least 1 centroid should be expanded, and up to 3 centroids may be expanded until a gap of 0.10 in "goodness" occurs. Query 34 exhibits such a gap between centroids 3 and 2;

hence only centroid 3 is selected for expansion. The expansion of centroids is shown in Fig. 12. Query 31 and 32 both

now have an active set of 6 centroids; query 34 has 3 centroids in its active set. Again the active sets are matched against their respective queries This time one centroid (centroid

and the "REJECTION" parameters are applied.

4) is dropped from the active set of query 31, 4 centroids (centroids 5, 6, 7, and 9) are dropped from the active set of query 32, and no centroids are dropped from the active set of query 34. Fig. 13 shows the selection of centroids to be expanded from £imong the active sets. Again applying the "SELECTION" parameters, no gap greater

than 0.1 occurs within the first 3 centroids (centroids 7, 9, and 5) for query 31; furthermore, centroids 6 and 10 have a goodness within the "EPSLON" of 0.05, and hence are also selected for expansion. A gap greater than 0.1

occurs between the first and second centroids (centroids 10 and 4) for query 32; therefore only centroid 10 is to be expanded. Query 34 has a gap

in goodness greater than 0.1 between centroids 2 and 8; thus only centroids

1-28

11 and 12 are expanded. The expansion for query 31 produces 43 document sons,easily satisfying the need for 12 documents. Query 32 finds 14 document sons during expansion, Query 34, however, finds

and the need for 12 documents is again satisfied.

only 8 document sons and 3 centroid sons on expansion; it thus becomes necessary to search further to find the additional four documents. The selection Here,

and expansion of a third set of nodes for query 34 is shown in Fig. 14.

expansion of centroid 5 produces 21 document sons for query 34, thus filling the total requirement of 12 documents. It should be noted that the selection of documents to be searched is not equivalent to final searching. For example, query 31 must be processed

in a regular search against the 43 documents selected (instead of performing a full search using the entire document collection). of the searching runs are shown in section E. The averaged results

D)

The Searching of the Document Groups

Once the documents to be searched in a given iteration are selected, the query used in the search process is constructed. The search query is the author's

generated using concept numbers from four distinct sources:

original query, documents which the author considers relevant before the search is started, specific concepts and weights which the author would like to add to the query, and relevance feedback information from previous search iterations (if any exists). Information from the first three sources is con-

tained for iteration 0 (first search hence no feedback information) in Fig. 15. The following information is given for each query: the authors, the

query number, the iteration number, the sources of the query and the corresponding document numbers, multipliers for these documents (all weights of

1-29
o
UJ 0. OCO ;Q • Oj Z|

X

X 3 c

c

X 3

CX

a o . x o o •
X JJ ^

XI

o
o o z
T

O
Q

z
3

Z 3

a:
UJ a I

o

o

a
0. • a: UJ 3 > X 3
<T 00 t •— 0 0

3 ?_ O O Z <J)

UJ

a

•

a 3

a: z O O
H-

X O

o o

X 3 3 a. J C
J Z 3 3

3 0.

uo X O 00 •

z

zi

z
3 3 >
HM *

Q X O
jl

•— oO JJ -J

a o a: > o
3 3D

Z V D CD

H- O «3

*:
o
3

J

X

a
3
3 IX

z
3

z
3
J

Lb

MI

c

x. O UJ » — ^ Z>
UJ _ J <

C O

a) •H
h - 'NJ

0)

fc
X 3
•c - 3 uO oO . j

3

T

E (3

3

c.y

3

<
3 3 3 3 *
at \j

3 3 X 3 O O

X <3 OJ oo 3 Z Z O

3 O X ex: C D U

x:
0 rd 0) G O

" »

o

bO

7* 3 3

^

Z

V

x O o 3

r>

3 e>

•H
JC

r:

0

0)

r\j i— X

>r -H

5

X f^ 3

— — -» . ,
M«

r\i

H-

~g

C 0
•H

3

tin •H h,

3 T. u3 u O
y^ jj

3
31 u3 OO O

h3 U rt O
Cf CD 3 to CD CL

•- l

<
3 * —• Z O 3

O

UJ 3 3 X
3

x: z

3 3

—<

3 O

k —

-"'
3

o o

i/)

O

irst Cons

3 3

4-

J

r ~ l

•

o
T I I >

X

a

< iX 3

2: X 3 Z

<

J < < X. X X X 3 3 Z Z CD

O

h3

3
> JJ CL — 3 /) UJ 3 UJ —•

a
> 'JJ X — 3 oO JJ 3 U 3 a - . C i a. _j 3 3 <X
3 oO X u3

LM

o
3 UJ

3 a «-« x
OJ Q. J 3 3 a- -c
- J ^/) X < 3 vO to

3 a. CJ a. 3
- J oO

~. x _J 3
Q.

a.
3

^3

< Z
X 3

X

uO

3 3

— 3 3 X
—• >—

X
JJ

3 O

— 3 3 kX — —
3 <

X
JJ VI

Z 3 3

n

*

<S 3 00 Z X tO z -. 3 3 3 X X 3
M l I— JJ

— *

z
>• •3 3
UJ M < CX OO

> o
3 3 3 O Z O O

X 3 oO 3 < 3

3

X 3

3 oO <S 3

3
3

O
J^

O

3
Z

x o
X

i 3 Z 3 O 3
3

^. oo 3 UJ Z

X

X 3
3 to tO

a:
3 X X 3 X 3

o I I X 3
X 3

*O uO O

LU O -i 2 Z)

3 X

a

1-30

concepts in a given document are multiplied by the specified multiplier), and the query concept numbers and their weights. For example, query one

uses the concepts from the original query (with all weights multiplied by one), plus all the concepts from documents 3 and 10 (all weights in both documents being multiplied by 2 ) , plus concept 12 with a weight of 14. Query 1 is then defined by the combination of all these concepts and

their weights. Following the initial search query set-up, further modifications can be made before the search is started. The user specifies the type of

modification to be made by introducing parameters as shown in Fig. 16. The options for the query modification are listed in Fig. 16, one section being devoted to each iteration. follows: The parameters are defined as

ORIG MULT PREV MULT MIN CORR

Multiplier of original query. Multiplier of query used for previous iteration, Parameter controlling retrieval. Any document

with a correlation less than (or equal to) "MINCOR" is not shown to the user and is deleted from the recovered list prior to sorting into correlation order. The higher this value, If

the faster the system can answer a query.

punched, the field must include a decimal point. As usual, a blank field is equivalent to zero. TYPE CORR The type of correlation to be used. If blank,

the correlation of the previous iteration is used. If blank for the zeroth Iteration, At present, "COSINE"

'COSINE' is substituted.

is the only available correlation.

1-31

NORMAL

If this field contains the word 'NORMAL' for each definition, "RMULT" is divided by the number of relevant used in that definition. "NMULT" is likewise divided by the number of nonrelevant used in feedback.

ITEMS £ MULTS

This field contains 'YES' if specific items and multipliers are given for each and every query in this iteration.

CONS & WGHTS

This field contains 'YES' If a specific vector of concepts and weights Is supplied for each and every query in this iteration.

FREEZE/ FLUID

If this field contains 'FREEZE', the items seen by the user defining the query are frozen in the order seen. Otherwise, all rank posi-

tions are available and all documents are correlated. UNITVC If this field contains the words "BY WORD", the weights of a given vector are not normalized. If this field contains the word 'COSINE'

all weights in a given vector are normalized according to the cosine correlation prior to being added to the composite for the new query. This produces the same weight for all documents, regardless of length. This Is accomplished by

multiplying each weight by the suitable multiplier and dividing by the square root of the sum of squared weights of the vector being added. To prevent weights from disappearing (due to integer arithmetic), the multipliers must be set at a high value when using this feature. If this field contains the word 'LINEAR' , normalization is accomplished by dividing by the sum of absolute values of all weights in the vector being added.

1-32

WGHTS DRPED

This field is of the form

T

XXXXYYYYT.

If TXXXXT

is !NEG.' negative weights are permitted; otherwise only positive weights will be kept after definition.
T

YYYYf can be either '

T

, f ABS. ! , or 'AVE.1.

If TYYYYf is blank, only concepts with weight zero are deleted from the new query. does not change correlations.) (This obviousLy If ! YYYY T is fABS.'

then all concepts with weight less than "PERDR?" are deleted. If fYYYYT is f AVE. T then all con-

cepts with absolute weight less than ("PERDRP"* the sum of absolute weights)/(100- the number of unique concepts) are deleted. The former method

is used to delete weights less than a specific value, say 12. The latter method permits dropping all weights less than a certain percentage of the average weight. For example, if all

concepts less than 90% of the average weight a . e :^ dropped from normal composites, 75% of the concepts are deleted, but only 40% of the weight of the composite is lost. PER DRPED (See above. This is a floating point number and

must be punched with a decimal point.)

The second line of Fig. 16 covers parameters used for relevance feedback, and not for the initial iteration, although the values are printed.

Definitions for the second and third lines are covered in the discussion for the second iteration. For query 1, the user-supplied parameters call for a multiplier of the query of 1, nonnormalized vectors, additional items and multipliers., additional concepts and weights, no normalizing of weights ("UNITVC" = "BY WORD"), and dropping of all concepts whose absolute weight is less :han

1-33

(99.9 * sum of the absolute weights) / (100.0 * the number of unique concepts).

In query 1, concepts with a weight smaller than 40 are dropped in accordance with the specifications of Fig. 16. The display of Fig. 17 for

query 1 shows that the original 23 concepts (formed by combining the concept vectors of the orignal query plus author-supplied documents and

author-supplied concepts and weights ) are reduced to the six concepts shown in the figure. These modified queries are then correlated with every document in the group previously selected (in this case a full search of the entire collection is made). After the queries are correlated with all the documents,

documents having a correlation greater than 0.015 ("MIN C0RR" for this iteration) are ranked. The top 30 documents retrieved are listed in Fig. 18.

The first two lines of the listing contain the titles for the iteration, the query title and the relevant items for the query. (At present, the document

relevance is pre-judged and held constant for all runs using a given query collection.) The major section of the page contains the correlation and The recall and pre-

rank of the documents retrieved for each iteration.

cision values (defined in section E) obtained after retrieval of the given document are also given. For example, document 69 is the first document

retrieved for query one in the first iteration (iteration 0). The correlation coefficient of this document with the query is 0.3924, and the recall and precision values after the retrieval of document 3 are 0.0 and 0.0 respectively. Similarly, document 17, a relevant document, is retrieved with the query is 0.3430. The recall and pre-

rank 2, and its correlation with

cision values after retrieval of document 17 are 0.333 and 0.333 respectively.

1-34

>- <M ft m
o o
O 3

TO sf

>t o
o

o

•

LL

LL

z 3
O

<?>

<r
iT\
—4

a o
X X 3 3 oO cO

—*

vf

•
X
X

4-

X X 3 3 cO oo

f\J II
CO H-

H
H-

UO uO

3

3
X 3

O

X I 3 3
JJ UJ X -ft

X

o

X 3
UJ

X 3 u

JC

o*
to

u0 -T

oO h-

X 3
LU

>» ——
X X
UJ

3M 3 M
LU

ft ft

z
3
3

r*- H

'^
ft

-^ vJ*

X

ft

3
JJ CL

< < < X
Q 3 Z
a:

3 < X

O Z <

ac.
3 1/3

3 < X

3 Z <

<
3

• -

rsj

fN

^/) cO •- •CL CL
UJ U J

*
Q_ CL UJ JJ 3 O Z l Z

r- r-

<

<
3 3

u_

SUM

3 2 O

3 Z 3

X 3
LO

i

Z ^ 3

wl

j->

—4

rn

o

CO 0 0 • — t— QU J UJ

o.

3 3 3

> OJ —>
"T ~4

U Q

T

3 o "O JO

3 a
3; 3 - • !\J

z z 3
3

3

X 3 CO

I

i

"ft

•- o U 0 rr <r
sf

!3
3

3 3

<
> -J -J

Z UJ — <C 3 3 < * UJ O UC UJ T
-x.

<

ft 3 Z

Z 3 3

fO T i

X
L-

s\

c O 3 O

<

> 'ft 3 | 3 -JiZ Z^co

<
3 3

z &

-o

— •

|x
h-

> JT -I 3 -J Z Z UJ - . QC 3 <

<
X
:/j

r — rv X

—^
.o <
3 X
ac UJ
(-••NJ50 ft r^ >r

>^O O*

o o

M

0
^J

X 3

M
JC

II X.
•— UJ

Z

ac

o o

•y> •—

z x> &
3

r
3 —» JJ

3 3

^i -«

a •—
3
•—
(\J

a*

» —
JJ O0

X 3
;UJ

o o o

•
oO Z

X 3
•~t

^ ^ 1 c( 4LH

>

(U 0

u r: •H

C rd D

LU

JC
'JJ

<
JJ O

ft

*"
4^

> >r

z

~*
LL.

> — 3
3

<
y> 3 »CL cO *- Z

z

—1

J•—i

o

— * •

3

<
a »O. cO ~ Z V 3D < 3

3 ' CL o0
LU

z

•ft

f\J <N 0 0 "»•> CT s f

ft UJ 1— 3 _J

-<
3
3

—4

• > • • LU O < ^ 3
t

» 1^—r g ft
Z ro 3 C*
3

JC

0 0 •H 0

c u 4 ' ^ <
4
J

> rn

Q 1
k~

^> * o
<I LL U X 3 <y>

<*,
z

Z ^

O
—t

^ m ^
LH

o

3

o >—• —<
t—
•—4

LU

z

^ — 3 -< 3
^A
CD

a (/)
•--1 M-,

^

Ui,

4-J

—
3

^) <
J3

CL CO
M

»-

UJJ

>< dD 3 <
Q uj 3 CL JJ

to

r—

X

a.
3 uO JJ

>r
•—*

— . ac
3 3 Z

<
a
a oo
ji 3 ut JJ J«
JJ

a. »—

z 3
o

—4

<
UJ

a 'JJ a »-ft JC JJ UJ ft ft

3 uO

> < X
•—

JJ Q 3u JJ Q. • -

> J* r» >r > f — 3 < — * • z >r a: *-• o t— 3
00

>-

fM

>r

X)

U_ 3

51 3
J0

—z >- 3 cO < Q <
UJ 3

3
LU H-

<
JJ

X 3 X 3 -0

:^ a; ;^ .a 4> H C O CJ ' H C D

<—* 3
JJ

>
1

ft

4/

C ^ 0 o

•>r

a. LU

JJ

3

ro

oc
• -

•^ * < X
z

ft

< > -^
'M

UJ

:> 3
<T
v/> HCL UJ

Hat UJ Jf

Ci H-

< <

< > < X
'JJ

a . >-

3 to OC UJ Q HUJ JJ

Z 3 3

J ^ —4

3

Z
UJ
•—4

< > <
JJ

H

x:

H-

ft

t r CM 3 ^ O
•—4

J J LU ft ft CO c o • - V-

ac X

O
LO

> >r — ft
TO

X

UJ

3
13 Z
'SJ 0 0
•~4

_
y> •« •—

>

3
OQ

X X
UJ UJ ft ft

-*
CO 3 aC

O

Z 3

^

— •

XX 3 3

JJ

z

z
• - <

o 3

> r

CT
OJ

<
y)
HQ. UJ

*-x
UJ

X X 3 3 • • * »-^
JJ X UJ ft

3 3

Z O

--)

<
cO

< X
31 X 3C 3 J.

» » CL
UJ

a.
J J
• -

/>

- J

»-• TO ft o -*
z 3 3
- 4

o
3

z 3
O

a.

3

3

»- >t N *N r*» •^
Z -« o o 3 ^r —4)

> 3• 3
N > 4
UJ

• *

*X
Z 3
3

o J>

^

w

•«•
• *

»-• 3 < X
^J 3

Z 3

-^
JJ

JJ

3

3

X
• -

<
si 3
CL X 3 3 0C 3

X
h»

X

<
uO
3 CL X 3 3 3C

<
^o 3

t>4

o o

CL X

a

1-35

o o

rvj

m
<f vT

>0
v *

•* "
O

o o m o o m o o o ro iT\ o m ao UN r<N <Nj >Jrvj o

>r
T O; u-l 1 h-

o

o o

O f^
r«">

O cr >o

<\j m o o o o O o o r- rr> o

O <\l o <NJ vn <AJ T> r g r\J

o

r~ >r o to

r o m I P Q0 " 0 !M _ i

\f\

o

> *

_*o _

o

IT CT O O

a o

O

O

o a o

O

O

o o c o o

cr r r- >o >o -ro o o o o o o o o o o o o o
O r-

r- m

> *

oo CO ao r - p O o O o O

> *

o IT*

m

lO o 'Nj N

N» h-

o o
O ?*

O

O

o o

* -* -1
o

O

eg o m o P * vT o ao sO h r - LO ro i/> <Ni o X r~ r g cv - N *\J r\j PSJ r g r g

C C m D
ir> ITS >J-

>0

o s\
f\

O

LO.
^4

o

— «
C*

^ )

-n

r g CM

^ r* o
r-^

m o o o

O

O

o o o o

—* ^ D «t
O

m
<?

°

o o o

3

O

O

o o

O

Q

o o o o o

a o

a: or

c o c o

• _
UJ (T

or

3

—

J f\J

LU

^ f f l p O f r i ^ ^ ^ ) ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ g>gO >0 O «0 43 >0 <0 s >0 <0 >0 ^ O O C

o o c c o o o o o o o o o c c o c o o o o o o o,o o o o o o -*
mm-*", r^c*. c O r ^ r ^ r ^ O O O O O O O o l o O O O O O l O O O O O O ^r*>rnmmrO>c O g T O O O O O O O O O O O O O OlO C O O O O O r O m m - n r o r n ^ ^ ^ Q Q Q Q Q Q ^ - s c o O O OO 0|0O O O CO O O O C C - 0 0 0 0 0 — « - ^ - ^ — « — « ^ - ^ ^ — • — < — « — —. — — —•—•—''-<'-'

X

o
a o o
3

a)

0

(/)
4 '

p-* O
Z

>fr^<fgDr\,,or^>*^.orgmrg

gf«^»Mgrrggf»r\rg-*aDirtrgrgrog3

H in Q)
O

Mi

*3

ys

**. ^

•».

^

>0 —•

rg - •

>
o
(D Q -J

r

p^<M^^m^5^ x

a)

3

o

or

a:

X X

-j

3

o oc
<T
UJ

o

^
wL

— ^) » — rg — ^ < < oo <
•1 k —

a

z.

-J i

^ / I S « O O ^ I T > O H

aor*-gD>o^^^^oor*"'^Or*ir»«,Mrgr»-ir>^oog3«o

3 at:

UJ - ^

C m og rg <\j rg; r\» rg rg rg -^ **\m* r*********^******** T

*-»^ ^ ** «^ I-I —«:o

or 3

o ooo oo o o o o o o o o o o o o o o o o o o o o o o o o o
^ O N 0 0 > T ( 7 , ' 0 ' f | l f © ( O r N J N N 0 0 i / > > t 0 " l > O >-< 00 O O N r- IT 00 N C N ^ o D i n i r i r ^ ' N j ' t ( D N ^ 4 , - ^ m ^ ' \ j a o - < y ac in s ^ f f l >t in w o (71 0*^Ooo<0(vj^-H^r'NN>OfnWfOrgryj^^-<0^^0*aooooOf* mrorrrgrgrgrgrg^^^^P^^»^--^^«^^-«^p-«^oOOOOOO
3

o

•
o u o

o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o

1-36

The last section of Fig. 18 contains various statistics for the run. are defined as follows:

These

DOC. CORR

The total number of document-query correlations performed in the given iteration.

CENT. CORR

The total number of centroid-query correlations performed in the given iteration.

DROP DOC

The number of documents with a query-document correlation of less than "MIN CORR".

CORR. RANK

The number of documents with a query-document correlation of greater than or equal to "MIN CORR".

OLD. DOC

The total number of documents previously seen by the user.

OLD RELDOC

The total number of relevant documents previously seen by the user.

NEW DOC

The total number of documents (relevant and nonrelevant) shown to the user in this iteration.

POS. FEED

The number of items in the definition of the query with a positive multiplier for feedback.

NEG FEED

The number of items in the definition of the query with a negative multiplier for feedback.

QUERY CORR

The correlation of the query used in the present iteration with the original user query.

REC. CEIL

The recall ceiling seen by the user.

The listing of the retrieved relevant documents completes the first iteration of the search. At this point, the user makes relevance judgments,

or, alternatively, prejudged relevance decisions are registered, and a new

1-37

search query is constructed using information about the retrieved documents, The user-supplied instructions specifying what information is to be used, and how the new query is to be constructed are taken from the input parameters (shown in Fig. 16 in the second two lines under options for SEARCH 1). The definitions of the parameters are as follows:

POS MULT

All weights of the relevant documents used in feedback are multiplied by this number.

NEG MULT

All weights of the nonrelevant documents used in feedback are multiplied by this number. To

signify that negative feedback is not desired "NEG MULT n is blank or zero. POS RANK CUT All relevant items with iteration ranks above "POS RANK CUT" according to the ordering of the previous iteration are used in defining the new query. NEG RANK CUT All nonrelevant items with iteration ranks above "NEG RANK CUT" according to the ordering of the previous iteration are used in defining the new query. POS CORR CUT All relevant items with a correlation above this value are also used. include a decimal point.) (This value must If "POS CORR CUT"

is zero or blank, no relevant are selected due to this parameter. NEG CORR CUT All nonrelevant items with a correlation above this value are also used. POS ATLEST At least "POS ATLEST" relevant will be fed back (if they exist — i.e., more remain to be found), NEG ATLEST At least "NEG ATLEST" nonrelevant will be fed back.

1-38

POS NOMOR

However, no more than n POS NOMOR" items will be searched to provide the n POS ATLEST" relevant documents.

NEG NOMOR

However, not more than "NEG NOMOR" nonrelevant will be used. Note that only documents scanned

in an attempt to locate relevant documents for positive feedback are used in attempting to find nonrelevant for negative feedback. UNLESS Negative feedback is done except when "UNLESS" relevant documents are found. If "UNLESS" are

found, no negative feedback at all is done. To signify that no negative feedback is desired, "NEG MULT" should contain blanks or a zero. Should
!

UNLESS 1 be left blank or set to zero,

negative feedback is attempted regardless of the number of relevant actually used in positive feedback. STOPALL "STOPALL" is set to
T

YES T if the user wishes

to stop considering documents for feedback once all the relevant documents have been found. If set to
!

N0 ! T documents will be considered

until the specifications of the other feedback parameters have been satisfied. is PREC CUTOFF
T

The default

N0 T .

If the precision after "POS RANK CUT" documents is over "PREC CUTOFF',' and if the precision after more items are judged drops below "PREC CUTOFF", the judging of documents ceases.

POEFIN

1 1 T

SILENT1 if search queries are not to be printed; STANDARDT if search queries are to be printed; DETAILSf if details of the search query defini-

tion process are to be printed (used only for debugging).

1-39

Using Iteration 2 as an example, the new search query by the following equation:

Q.

is defined

n r ]=1 where

n s ]=1

Q. , - (1)Q. + (1) I (r.). - (1) I (s.).

(r.). designates the concepts and weights of relevant document

(r.). ; (s.). designates the concepts and weights of nonrelevant docu1 " J i ] ment (s.). ; Q. Is the previous query (for iteration i ) , and n and
l ] l r

n^

are defined by the number of relevant documents retrieved and the num-

ber of nonrelevant documents retrieved, respectively. In iteration 2, n < 5 ; therefore, only the top five documents If

are retrieved, and not all of them will be relevant (in most cases).

at least two relevant documents are retrieved among the top five documents, no negative feedback will be done (n =0). If fewer than two relevant

documents are found, any nonrelevant retrieved among the top two documents will be used for feedback (n < 2 ) . This condition is stipulated by an s — "UNLESS" of 2 and a "NEG RANK CUT" of 2. The newly defined search queries are shown in Fig. 19. For query

1, one relevant and four nonrelevant documents are retrieved in the top 5. The relevant document (17) and the only nonrelevant in the top two retrieved (69) are used to construct the new query. Query 2 retrieves 1 relevant,

but the top 2 retrieved are both nonrelevant, so both (27 and 33) are used, together with the one relevant found during feedback. Query 3 finds 3 releThe

vant in the top five retrieved; hence no negative feedback is used.

new query vectors (Fig. 20) are used for searching, and the results are shown in Fig. 18, second Iteration.

1-40

— g3 (A C J v rg

|g;

O eft - * rg U

If

00

X Z 43 <*> O O r g 43 43

X

r g ao

1
<\i rg eg - . .-. —i

UJ

at < 3 or

»~ eg eg eg

3 to

43 - « >*

O

rg r- \ *

Z *>i rg m O rg,g3 tf> X 3 UO

1* i
< J <M r\J N to O at rg 43 rg lf\ if» ao g? Z tfV* o| O —,ro + ! U ^ r\J

°
at

<
X

rg v\ cr

- * rg

U Q) P " Of
at I

o
h- >T r g ^ <v» ao

a
PH

.-

r g •«•*

rg —i UJ

eg r- er
X

- < a o vn <+i

fd (D GO 4H

o

-* rg m

I to

at I

°
b0 •H
GM

O

o
>0

3
3 -J

•- ao rg N rg
O - • -* -*

o
•H

o G
•H Mn CD X> Q I

-<

vO 0 0 PsJ

z o co r» \r\
O «-< kf\ «-• |g0 O rg *
CO

o
to

rg rg g?
mm — eg

* - O rg rg g j i m —• — ro « 3 3 tO

a)

>-

to 3,

3

OC - I UJ UJ 3 Ot 3 I

>
ac - I UJ UJ 3 at

oO

z
SSI ±J

13

Z

o
mm v J

|ar Uo 3

UJ - I at U J a at

>

UJ - J at U J a at

>

i Z u Z

3 at ry I to u
3 ;

ac - J I UJ UJ

o! u

Z O

r»

00

CO (*> H 43 r*

Z O rg - < r*. O tf> ^ r g O rglgr

z

> <
H-irg eg O Z u> rg 45 43 4>
CO

<
h- «N rg rg rg

o

<
Z
UJ; 0D|

* — -* -• -»/> ao I A m

X

UJ

>
Z 33

>
UJ .

at UJI a at

z
at • - rg rg rg

o

<

eg

<
»- g? g j rg <M — g> ff> wm mm

o
to UJ

o u
u.
rg 1

2: - • N- o>

o\

fO I** fO * rg rg

43 rg

°

o U

ci ^ o* ^* rg

o

x o

1-41

E)

Search Evaluation

Several different evaluation measures are used in the SMART system, all based on the concepts of recall and precision. measures are the following: The definitions of these

Recall = - 4 D

Precision = c where

a - the number of relevant documents retrieved b = the number of relevant documents in the collection c = the number of documents retrieved.

These measures are usually computed at a specified point during retrieval, usually either after a given number of documents have been retrieved, or after a given recall has been obtained. Two types of averaging graphs, and four types of overall recall and precision averages are generated by the SMART system and listed in Figs. 21, 22, 24, and 25. ages. Fig. 21 shows one type of graph and all four overall aver-

At the top of the listing the runs being evaluated are identified (in

this case a full search run (run 0) and a centroid search run (run 1)). Below are listed the recall levels being used, and the precision achieved at each recall level. point is also given. The number of queries used in the averaging at each For example, at recall level 0.10, run 0 shows a pre-

cision of 0.4948, but for only 2 queries a relevant document had been retrieved at that recall level.

1-42

R E C A L L

—

L f c V h L

A V E R A G E

LEGEND:

RUN RUN

0 — l - -

i5

JUERIES

(PLUS (PLUS

0 NULLS) 0 NULLS)

— —

35~ Q U E R I E S

SMART FULL DUCUMENTAUGN SMART CENT DOCUMENTAT I O N

SCH KUN SCH RUN

RUN

0

RUN

RECALL 0.0 0.05 0.10
0.1b 0 . 20 0.25 0.30

NW PKEClilON U I 2 5 13 16 18 24 24 24 31 31 31 31 30 29 29 25 24 24 28 0.4948 0.4^*6 0.<»948
0.H734

NQ PRECISl 0 1 I 4 11 12 12 16 15 15 16 15 15 15 13 13 13 13 13 13 14 0.4P13 0.4813 0.4803 0.4620 0.4197 0.4148 0.4073 0.3797 0.36 80 0.3599 0.35 99 0.2900 0.2895 0.2800 0.2154 0.2154 0.2102 0.2102 0.2034 0.203^» 0.2034 0.4920 I.0000 0.2203 0.3458

0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 NURM NORM RANK LOG RECALL PRECISIUN RECALL PRECISIUN

0.V2 82 0.4222 0.ti79 0.3612 0.3791 0.3690 0.3oo6 0.2901 u.2876 0.2661 0.1961 U.1946 0.1918 0. W i b 0.1636 0.lo36 0.ioJ6 U.7J24 1.0000 0.2014 0.3435

SYMBOL K E Y S :

= NUMBER OF QUERIES USfcU I N THE AVCRAGE WuT OEPbNJENT ON ANY t X T K A P U L A T I u N . NURM = N O R M A L I Z E J .

N<J

R e c a l l - L e v e l Averages F i g . 21

1-43

Below the recall-level averages the four overall averages are listed. These are described more extensively in reference [3] (chapter 8) and are briefly defined below: [7] n Normalized Recall = 1 - n

lr. - U
n(N-n)

iMui'iiidJ-j-Zieu

r i ' t :u_Lfc)-LUii

-

J.

-

n I log r i=l
log

N

n I log i i=l

(N-n)!n!

11

Rank R e c a l l =

1 =1 n

i=l

I r. X
n I log i i=l
n

Log P r e c i s i o n

I log ri=l

where n = number of relevant documents N = number of documents in collection r. = rank of i
1

relevant document relevant item.

i = ideal rank positions for the i

Fig. 22 shows a computer-generated graph of the recall and precision averages previously given in Fig. 21. In this type of graph, the precision is recorded for a given recall level,

1-44

r

c
• • • • • • • • • • • rC

c
I

*

•

•

•

•

•

•

•

•
— ©

•

••
—

.

.

.

" !~

!

!c

-< c

uO .0 X X <t 'I

.
'.

.
'.

.
'.

.
'.

.
'. '.

1.
\
j

.
'. •'.

.
'.

'.

!§

verages <
c\

O

c\
tH)

•

•

•

•

•

•

•

•

•

• o c

•

•

•

•

•

•

•

•

•

•o

<* *->
*-4

o I

O

o o

<£ "3

-J

• • •

• •

• •

•
!

• •

• •

• •

• •

• o
O

•

• o

— ^

M

*- — >•
i o — j
OJ IU .J

0 0.
y» I/)

u

X x

•

•

•

• ,

•

•
f

j

•i
t—

• i
l *
o o o o o o o

-5

JJ

2

x:

o
3 1 at
i/>

3

o o o o

o o o o o

o o o o

• -*

•

30 t

h-

o o o o

o o o o

o
3

>0

o o

(A

1

o o o * o

o o

•

•

• .

•

a o

- 'o
OJ

o

•

o
1

•

«• o•

o

• o

1

r

<

1
1

ecalc*

^ QJ > "3 * i1 H

•H
Ui

1-45

For example, at a recall of 0.10 (10 percent of the relevant documents retrieved) run 0 shows a precision of 0.4948, and run 1 a precision of 0.4803. These precision values are the averages of the precision, at a given recall level, for all the queries searched. It should be noted that interpolation

methods are needed to produce the averages, since all queries do not possess an exact precision value at each given recall level. The graph of Fig. 23 shows the necessary interpolation for a hypothetical query with four relevant items. The relevant documents are assumed Thus, at 25 percent recall,

to be retrieved with ranks of 4, 6, 12 and 20.

the precision is 0.25; at 50 percent recall, the precision is 0.33, and so on. However, these values correspond actually to the highest possible pre-

cision points, since they are calculated just after a relevant document is retrieved. In this example, after 3 documents are retrieved, the precision This range of

is 0, after 5 documents, the precision is 0.20, and so on.

precision for each recall level is indicated by the top and bottom points in Fig. 23 at 25%, 50%, 75%, and 100% recall. The solid sawtooth line con-

necting these points is not used for interpolation; it is intended to indicate the drop in precision between the actual recall levels for this query, as more nonrelevant documents are retrieved. The interpolation method actually used by the SMART system is based on the dashed lines shown in Fig. 23 where a horizontal line is led leftward from each peak point of precision, up to a point where a higher point of precision is encountered. This new curve (the dashed line in Fig. 23) When the precision

does not lie above the sawtooth curve at all points.

drops from one recall level actually achieved to the next, an immediate drop in precision after the first point to the level of the next point is

1-46

sawtooth curve interpolated curve

PRECISION

,30 ,25 ,20 ,15

,10 ,05

*+ RECALL
,10 .20 .30 .40 .50 .60 .70 30 .90 1.0

An Illustration of the Interpolation Method Used by the M Neo-CleverdonM Recall-Precision Averages

Fig. 23

1-47

indicated.

For example, in Fig. 23 the precision value at 0.50 recall is

0.33; but at 0.55 recall, the interpolated value used for the new averages is 0.25 precision. When the precision rises from one recall level to the

next, however, the first precision point actually achieved is ignored for purposes of interpolation. The achieved precision of 0.25 at 0.25 recall

in the example of Fig. 23 is ignored, and an interpolated precision of 0.33 is used for the averages for all recall levels from 0 to 0.50. The second kind of average graph also generated is shown in Figs. 24 and 25. In this graph, the recall and precision are recorded and averaged For example, after one

after the retrieval of a given number of documents.

document has been retrieved in run 0, the average recall (over all the queries) is 0.0903, and the average precision is 0.3714. The recall and

precision are averaged for 24 different cutoff points, and for 6 different percentage points, such as after 10 percent of the collection has been retrieved, etc. Three other statistics (besides the recall and precision) The first (NR) is the num-

are measured and listed for each cutoff point.

ber of relevant documents retrieved at the given cutoff, and the second (CNP) is the cumulative number of relevant documents retrieved by this point. These values are included to aid in the proper evaluation of runs, since the document-level averages are not plotted at equal levels of recall for each query (as are the recall-level graphs). used to obtain the average at each point. The final part of the evaluation process consists of tests of the significance of the differences between runs. Three basic statistical tests, Also listed is the number of queries

the sign test, the T-test, and the Wilcoxon signed rank test, are calculated for each pair of search runs. All three statistical tests indicate whether

1-48 D O C U M E N T — L fc V t. L A V t H \ t S

LEGEND:

RUN

0

35
3D

QUERIES
OUVRTES

(PLUS
(PLUS

0 IN.ULLS) 0 .NULLS)

RUN " ~ ~ T

FULL — SMAKT DOCUMENT AT I UN — S.VART CENT DUCUMtM ATIUx,

SO: RUN SCh r\UN

KUN
KAIMK

RUN
PRECISION 0.3714 0.3429 0.3095 0.2714 0 . 2 6 57 0.2476 0.2347 0.2^14 0.2048 0.1914 0. 0. 0. 1935 1853 1805 NR 13 11 7 4 7 5 5 I 3 2 /
i

NR 13 11 7 4 7 4 4 3 1 1 6 1 2 5 4 2 4 2 4 1 20 2d 21 15 0 53 33 38 22 7 17

CNR

Nw 35 35 33 . i i 33 33 66 33 33 66 66 12 32 62 i<L 31 60 29 29 2a 24 13 c 0 "

RECALL 0.09C3 0.1927 0.2265 0.2389 0 . 2 8 31 0.2977 0.3170 0. 3258 0.3315 0.3324 0. 3o74 0.3710 0.37ti3 u.4018 0.4336 0.4717 0.4950 0 . 5 0 15 0.523tt 0.528o 0.6571 0.8120 0.9430 1.0000

CNR 13 24 31 35 42 <t7 52 54 57 b9
dO

NQ 24 24 21 / l
L\

RECALL 0.0903 0. 0.2 1927 2o'>

PREC l 5 l i 0.3714 0,3429 0 . 3091> 0,2714 0 . 2LO/
J. Z"IZH

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 30 50 75 100

U
24 31 35 __^ 4o 50 53 5<* 55 61 62 64 69 73 75
7 9

0.23*9 0 • 2831 0.29H6 0 • 3 2 3o 0.331b 0.3417 0 . 343 ? 0 . 3 dfc7 0.40^6 0.42*7 0 . 4 3 97 0.4 too 0.469* 0 . 4 73 9 0.491b 0 . 4 9 18 0.49^7 0 0 / ^ 3 0.b24 3 0 . t>434 1.0000

19 IP 1/ 17 17 17 14 11 1 1 10 9 7
6
<t

0.2429 0 . 2 2 7o 0.218^ U.2 090 0 . 2 1 39 0.2117 0.210 3 0.2111 0.212+ 0.2141 0 . 2 1 0b 0.210b 0 . 20o9 0 . 20oo 0.19b3 0 . 1 ,3 3 b 0 . 1^37 0.2 J34

c»9 72 76 PC 84 85 8c 88 SO 93 93 10^ 170 170 54 9U Q 3 93 105 170

3 4 4
H

0 . 1 8 2o 0. 1824 0.1791 0.1796 0.1774 0.1784 0.1752 0.1657 0.1626 0 . 1 6 03 0 . 16 36

81 85 86 lOo 134 155 170 170 53 86 124 I4t> 153 170

1 3 0 2 3 0 16 61 0 5* 36 3 0 12 6 5

4 0
J

0

10.04 2 5. 0 i

50. 0 *
7 5.05 9 0 . O i 100.06

33 26 16 o 2 0

0.3258 0 . 5 2 8O 0.7576 0.8»62 0 . 9 U 5 1.0000

0.2214 0.1752 0.1643 O . i o l 9 0.1601 0 . I b 3 6

17 4 0 0 0 0

0.331} 0.4997 0.b243 0.5243 0.i>36<* 1.0000

0.227o 0 . 2 006 0 . 13 74 0 . 130 3 0.i^2o 0 . 2 0 3^

SYMBOL" K E Y S :

NR " "= CNR Nu = =

iMUMaER NJMtiER

OF UF

RELEVANT. NUMtltR QUERIES UF R E L E V A N T • U S E U 1:N THE A V F K A G t

CUMULATIVE

NUT DEPENDENT UK ANY EXTR APGLAT I L N . PLKCENT UF TbTAL NUMbbK Oh ITEMS i > , C O L L E C T I O N s Document-Level Averages F i g . 24

1-49
3

~5

O

•

•

•

•

•

•

•

•

.

• ©

c

© I

c

*
•

*
•

•
•

•
•

•
•

.
•

•
•

•
•

•
• o o

•
•

•

•• o
c
©

—

o •

•

•

•

.

.

•

•

•

•

•

• © X U. or o < J u' u
M

•

•

•

•

•

•

•

•

•

o

•

•
o o o r —
J>

JC

•"

V/> »•»

—X«— X
«l < 1 1
.< X

•

1

•
•

.
•

.
•

•
.

•
•

•
•

'

•
•

•
•

•
•

•—>

•
.

•
© • O -0

c

at:

f/)

a;
hi)

I
*-y

j
• . • • . . « •

o
* • • «•*

•

•

o o o

•*
o

:< :• •i:
(i)
. i

— —o « * ~*
• . . • • i • • • -* •

o
. • r>

o
• • • • • • • • • © o © • •

*
1 3

3

>
1

•

.

•

•

•

•

.

#

-* •

•

«

yJ%

• •
-3

• •
•

. .
•

* .
•

. .
.

• '
i

• • j
•

« « * • • -^ .•

• • .
•

• • .o
• ©

o o o
O

<n

•
.

• .
•

.

D 1 -x. 3 ©

UJ

» • > • * -J
3

o

•

S

. j

—* 3

«

•

•

•.

•

i
1

•

•

•

' •

•

•

<
/

at

X

• •

• •

,. .

• •'

• •

• •

• •

• •

• *

• •'

• ' •

£0.

*
a

X

I

O O 3

§

o o

©

o
3 3

o o

o o

3

3
*5

9!

o o o

©

8

3
X

ment
• 1

<)
i

0i

1-50

a given difference in two averages is likely to have occurred by chance. A one-sided test is designed to compare a supposedly better sample B, with a given standard sample A. H and H . H H Specifically, one proposes two hypotheses

states that two samples A and B are produced by the same states that sample B is statistically better than sample H , that a difference between

distribution; A. H

is accepted if it is unlikely, under

samples as great as, or greater than, that observed would occur by chance. A two-sided test similarly compares two samples under the same but with the alternate hypothesis different distributions. under H H H H ,

being that samples A and B are from is accepted if the probability,

Here again

, is low that a difference between two samples is as great as, or

greater than, that observed would occur by chance. The T-test assumes that the differences a. and b. , are distributed normally. d , and standard deviation d. between the two measures, d.

Explicitly, it is assumed that d and a, d

has mean

o2 . Note that d

are compu-

table for any distribution, including also the normal distribution.

In par-

ticular, it is known that many sets of differences are not normally distributed. (For further discussion of the T-test and sign test, see reference

[7], page 12, also [3] chapter 8 ) . The sign test assumes that a result is equally likely to favor Thus, it measures the probability of a more

either sample A or sample B.

extreme distribution favoring B, or favoring either A or B. The Wilcoxon signed rank test postulates that a greater difference between paired samples is more significant, but only as the numbers affect the ranking of the differences. For example, differences of - 1 , 2, -3, 4, and

20 are equivalent to differences of - 1 , 2, -3, 4, and 5 since only the rank

1-51

of the ordered differences favoring a sample is important (not the actual values of the differences). The Wilcoxon test assumes that the two samples

come from the same family of distributions, i.e., either two normal distributions, or two binomials, etc. The three tests are performed for eleven points of the recalllevel averages, and for the four overall measures of recall and precision of the document level averages; in addition, the tests are also performed for the 17 cost statistics. The three listings for the three different

test procedures (Fig. 26 — 28) cover only the first option (eleven points of the recall-level averages plus the four overall measures). For the T-test (Fig. 26), the following values are given for each of the fifteen statistics: the mean and standard deviation of the statistic

for each of the two searches (A and B ) ; the mean and standard deviation of the differences between the statistics for A and B; and a value T, which is defined as

_ .

(A"-B") * /IT"

where

N

is the number of degrees of freedom (which is one less than the The one-sided and two-sided probabilities

number of queries being tested).

(indicating whether a difference between the two samples as great as, or greater than, that observed would occur by chance) is also listed. Finally,

the fifteen one-sided tests are statistically combined into a single measure also listed. The sign test (Fig. 27) gives the number of queries favoring search A, favoring search B, and tied; the normal deviate ignoring ties (computed by using the binomial normal approximation); and the one-sided and two-sided

1-52
o o
(M

o

o o>
ir\

CO UJ

• ~* •• ro m •• ro «\J
CT

O uo «/> 3 3 or QL 3 3
00 00 UJ U J

•O LU O CD
OO rvj

K ' W ^ C *

<<
1 1 •- IT O O

<' o c > <
UJ -J LU LU -J

c QC

•a

^ ) o i o ^ rvj<Mru<oim4 , roo<otr\o > co o p © 0O 40 4 0 N l | O© C0 O0 O0 f0 O0 N ) O 0 o 0 0 0 o • • t • t I • o• l • * o o o• o•i • o o o o o o o • o o • • o i
l

«0 f * - l p - O

r g j j ) rw - •

o •J-

>
_J
-J <J

UJ C C O M Q
0 0 OC 1 CL

o

<* o O

> o ^ o i-4
*s.

— <<
CJ JJ

z z

r<> t o *-\fVi m m Nr o gj ro ro N k rg C* 4 ^ ro o N o O •^ m o O o o o O ^ ru o o o o o c o o o o o o o o o o o • • • • •i • • • • i • o• 0 • 0 • O o o o o o o• o •o o o o |
— CO )L/\

iT

mml

f~i

H

-H

*

o x x *••

cr o

or ar 00 O CD 1 LL LL M D Q
w

o LU
a:
1-

r*- if^ ro ro ^ fw

<N r o ' - t U> O O

^r "h1 11

m CO > rr\ o o eg
i 1

CO

m

sO • 4

o rg

r g - 4 00 O O rg r-

— 0 s >t rr. K lir ^ > r g ^ TO O O p G O 4 >o f^ K i

P M

QC QC

C

in

>e
UJ

< Z
te-i L» UJ -J

o TK o X
LL LL.

z

o a a z z u o o
>—• •—•

< z••
M

1 1

•1 1 ro l T •fM*f* r\j <\i oj c>4 o • rg m 'rg • • • • • j •• * <t
1 1 i 1 i i i I

o o cooM^ >

o >r
JJ

- ^ k \ j ro 1 11 1

•• •
J
1

o

o < o.

o < < z z u — »—• * =c u X X n t- 3 O
-D O O 3 r g rg
UJ

-»

I-

1-

1 2
u LU
ac
CO

; 1

a
O
LU KJ -J

<
a O0

aoU* co o >r 4 CO 0 0 UTi 00 If* 4" m CO rr\ 00 i/> O h- lr- ro f >r 4 IT r- 0 0 o|o -. o o O o O O
O IO O O O O O

* • ff\ 0> rgleo O t > OD O J3 *D 0>

O O

4 in ~*

1

4(rg

<\i

o o -j <o o z 3
a:

o o c o o o|o c

_ ^

<
O

O Q -1 -1
LU U J
i—• »-*

or z

to 1

!
r^ m *o
r o X\ I P O l — fM 4- rr\ m 4 4 <f\ r o —4 OIO O O o O

GALL

RANF RANF

C
Q

*

LU

< Z < X

a p — oo r» C — U o cr T PO JO a r g ro to co 0* O |rg LA ff> 4 g * LO -< o o O O O of o o o o•i o o• o• o• o• o• o o• o• o o• o o• o• | • • • i l l 1
. - 4

O

1

1

1

1

1

1

1

1

1

o o ot o
00

1

1

o o a M z z to O (
UJ

LU 0C 3 UJ »vl
M

X 1
LU O LU LO

o o o o
or •
O LL

*lO •—

1
—4

UJi<

_I »o
UJ —1

CO 0 0 UJ UU

o oc o ac
< «i
LU UJ O0 0 0

X

-J

(\J

z a

x: ac

<
a z

c o a c

^f I N
mat

—4

sD >C 30 og r g m O O r g <\J 4 L / > O 4 sO >0 v O l f > I O »r> ru ro (M P O m r*> m ro

1 r g jr> r g r g | ^ 4

mm

rLL - *

4-» CO (U H L^H

CO CN

buO

o o

—«

• o• o• o
•

O

• o• o• O• o• O• O• O o• o•|# o• • o
g? 4 O -* LC L O o co r- cr ir> o O 0^ ao g)> 0> — \t\ PO r o O ' l T l LO m >t *t PO PO r O | 0
rg o g>

r* 4 r g r g i g * <t po rg rg O j o O ro PO ro rO fO r o

O
CO LU LU QC O LU Q LU >

tM

H-

* < <* —1
CD CO

• z•
oO

or
LU CO

O X QC ~D
LL QC

Q

Q

o z
h-

LU LU LU LU LL LL

2 — u•
JJ

CC

CO

z

r-t

*• co »^- m* >0- r-o ro * >r r- O f > o *n « 4 >0
rr>

3 C O O O
PO LU

sO

o

X < ar.

ac
CL

UJ

< X

a* r*
0 | 0

co ii> O o

> o o o 0
NO

X
•-

o o oo o o

O

X HJX

Z

O X

e>
LU •— IOJ CO UJ
—4

aj -

o o rg
Q -J UJ

J

O

'

_ ) UJ Z> LU LL U .

— I< »LU QC QC < LU

o z o
X

*-* z «: e* x
LL

o z

»— L 00 o0

i•• < o
LU QC

OJ ac
CO

< rg eg Z X Z
LU 0 0 CL J J ~t 0C aC JJ U. 3 o0 UJ •—•

i •• ••
af
LU

a: « or o o o

< * 1
I
uO

* z

Q

<

(•>

h- co rg ~4 O 00 o j j ) —. ao ro >0 >0 >0 >0 ^ 4 ro ^ fNJ <M h * - r- m rO PO PO PO ro — r\j m ro rO «

CO CO - O r* t o O rO v T

IP t n X ro rO r-

ro - N O O

LO

»H r g r g ro rO ro O

o

•-i

ojooo

i

o o

O

— »* — < •^ -1 < -D > < 3 O
X LL CJ QC LU

< -J

z a

1— OO

O
• 0 o ao *f\ o o 4" - t -o rsj r*> rg t> O i> co r* >0 - « h a r»PO PO rg rg

Q

I/) UJ

o a z

o oC

3

3 3

< co
XI

rg rg 4- >

D < O
UJ UJ LL

* a

2 IT ^ -^ o *» -»
LU O -I -4

3 •*. < > / •XI < X

u

< z u < X

TO K . ^ i -o

X» r t —4 O 4" O O r*» o <N co r*» ro > o o o J J M A l A ^ <• • • • • • • • » • o•p • o o• o o o o o

O >t

>*•

• • O• 3•

•-

|
1

o o o z
K

J J UJ _» - J

* ••
-1
—4

z o < u > -• « u. *~ -• -. z 00 O
~4 < CO CO

^- XJ • < — ac 1 a 1 z 'JJ < < O LL > cO 1

O

\

cO

CO
UJ
-it

cO L>
•—«

—LL•—« LL • •
< CO

—— — ' ••
» •

z o

•»
L O

o
QC CL

at < z CO

CO OJ

-

a o z o

— ^O • < z -oi < ac
•t vO

•c

a.

X X at at:

^ • » o• o• o • • • • • « • • o• « 1 <O -J
1
3C
JJ
OCT

rg

r*\

g f U \ g } - N . CO |Cr

O

Q

o or
UJ

•• > - J

UJ CL

z -* o

C X O X •-

a co
X on

o o Z z

^ >

> -

1-53
o o
03

o
03 OO UO CL ec OO »/> UJ UJ X X O 03 Q < at O < «t; UJ

3

3

<<

3

3

UJ

>

ro

—z z o x x o
UJ O CO 0 O O O O O ta CO *• of rg * 00 O O O & o o* o o o n> i»* ^ co m r»- o> oo l i r e i i r o o o b ^ a «o !o> a W t o s e o o o 00 rCD O

o

l00 I N ,— <

Q Q C C O O U. U, C O QC QC

ac

o>

o
O O -<

o

c o
X X

< J
CD

*- o v » QC 1

i o.
~* >
JJ

•ooo-*—)~«oo©ooo

ur\ Q0 0 0

O UJ ~

Z U. U. :0 O O

o c o
,o z z
,-J < <

o »~
UJ QC .

' o n ! »- o a o o
K O O cc O

o ~ -« cc i )

o -• z oo, 3i

X O Q Z C

0^ 'JO 0^ fO f ^ O VOtf%.m ^ 00 irsj O O l O ^ f «t • t • I I I I I I IrM «4 ' • * I I I

ao rg IOO un un a* o^ > ^ o o -^ loo ao o w> un rg -4 rg <M rvj -< ioo - . >r co <*> 4- r g r g

• •'••••

O - * r* 0> 0> ( ^ i/> < J N
t • • •

P b p b b

o1 oI o o--

o !«-* ~< I I I

P o
I 03 kvj H kr 0> co 03 4>

poo
Icy rg rg

00, UJ y>

a
UJ

la*
jZ O O ! 3 - I -J UJ UJ :QC * - - • ' O U. UL I Z Z I — < «* J O OL QL

z
-j

o so
t O QC <NJ

— 3

-r

• >

1 0-

>C ^ |h- o > - h - r^ m r n —i r*» m (s- f«. t * . ^> P O O O O O O O O O O i
s0 S T N O ^ O O H

o rr\
O

«t f * -4 O !o O O O |
• •

\*t oo (rn O tf> h H - i
i •

• •

• • • • •O

•

• ;

O O O O

o o o o o o
UJ

0 o • O 1 • • |o o ;

o o o o

"P
O, Z P

cc»k Q i C LU UJ k>o 00 ' C C O O

> >'

o 00 — Q

ko o o
UJ

;•-* oo oo
H|» UJ ULi

I

O ac

— a

«r o o
QC QC !Z < < O UJ u j « - UO OO UJ O -J < O <

XX

<
II

LO CC I a

r* rr\ r* u> j j ^ >o co >o|f^ -r co u"> co ao leo eg I H O I O O cr\ - ^ >n m c*> < N I O O I O O r » 00 o o o o o o0 0 0 0 p o 0 0 0 0 0 0 O O p O p 1 O

o e o o

p O!
RJJ UJ X X I < 3 (1) 1 < X 00 X

(/)

[>

CN
1

p
hi) •H (/I

SI
l

bJ3 •H
IJ-.

> oo UJ UJ

eg •^ >0 0*

^ ro \f\ ro

rg 1^^ >0 O*

ro r o »^ »M <N eg ~* «-«i

rO H PO Wro p f*\ r g

O leg I** p o* O H ^ m 00 >o >o O 1* * ir» ^ i/\

g> o ~4

UJ UJ

be ac kx» UJ

1
1

QC

,—» ao co

0C Z

r g i—• r g ^

r g |rg r g

o,
•!
0>!

o
UJ

b O k ac

r* <
be oc

o oia
t o UJ UJ
UJ UJ

o o
Z ~i

O O

o
GL

j z u. a.

f< < P 3

o
•»

Is
IJ l O _ 3|UJ CO U. U. g? O >0 >fig> < - 4 - ^ ^ m ro t I co 03 rO 0** 00 >0 r g r g |pg -* y* **

p ar
ko to Oh

o o
rg

kj o

<;
ok
<|oo I I I

QC

o »-!tZ oO oO , < rg|rg at O
>• >

<

si
QC

x z{z c* oc of o oo u.
QC M M U J wO oO a. UJ U J X or,ac 3 u i UJ • 3 3 Or 3 rgjOl

iC

z

QC

<
1

!S§
i
> K

ao 0> ao f * | ^ ao

f* O H rg

UJ

-1

!< UJ

u. x

K <
u> co

t o CO

i/»

1

p or K °
|a. a.
1
JJ

•- p

z

IU

aC 3 v^) UJ X

;o < ;z* 0 Q C O > < 'UL O X •UJ X

ao r»« ao - « ~« —«

UIIUI

o
O CO Q UJ o —»

<

uul
O -I,-J

o~~
I I WWlU.

2
u0

OM«M

• ••
loo

i n gj|*» I • •; •

• •

o> o |

o
UJ Z 00

o
UJ to z i Z z **: 00 k 3 00

5
oO

W i
n* "*
oo

oca.

10 0 0 0

s\wT
o oc
Q J o C O O O I UJ -I Z Z Q Q C C

X o o

X 0 o

1-54

probabilities for the test ignoring ties.

The normal deviate and the one-

sided probability using ties (based on a method developed by Cathy May [8]) are also computed and listed. combined into overall figures. The Wilcoxon signed rank test (Fig. 28) gives the sum of ranks favoring search A and favoring search B; the number of degrees of freedom (specifiThe one-sided tests are again statistically

cally, the number of untied pairs); the normal deviate (computed using the Wilcoxon-normal approximation); and the resulting one-sided and two-sided probabilities. A statistically combined significance value is also listed.

3.

Access to the SMART System The SMART system exists at Cornell as a private library system, lo-

cated on a disk, which is accessible by reading in sets of control cards. When the SMART programs are loaded, a routine called EXEC receives control. This routine interrogates control cards in the data stream to ascertain which routines are desired and transfers control of those routines in the sequence requested. A typical deck setup for the system is reproduced as follows:

initiates SMART routines sets up document groups for a collection already on file

[ //JOB Hi
/-SMART ( CLUSTR

(parameters)

(parameters) (parameters)

/ performs retrieval runs using methods called for by the parameter cards SEARCH (parameters) (parameters)

1-55
o o o

r\j

«*
WO 0 0

uo
UJ

>• o *-* •• r«
in

O

3 3 ac ac 3 3

< DC
UJ

» • <! %
cO

< uo vo
UJ U J

<f

> <
•J UJ

1

X X

<r >o
^v O

.-« N
O

0>

*a o —» z z o <r < OJ o £ 5 ac QC uo 3 a U_
M

u-

>

,
-J
— 1

1

,
•

<x o
UJ ac Q Z

r>0

< < a

O a ac ac c O
31 U. O

1
•

Z u.
H-

^
M

UJ

o < CL

z Z o o o UJ
.—1

< » z a UO
M

i

1

AL TO 0 COM

5

LIZED

«* o z z u r cc
C o o o o

- J 1— l ~ —1 <I

•• — o
UJ

ac a

Q UJ Q CO •—i O UO a 1 r\j

f J O - ^ O i T » m a o » D i r N ^ f M > o oo ^ . rp-*fNi(rmeDco^iQO^^OvO«>|i/>ir>

;
1

oiooo^«ow;ooooo«Mw o o o o o o o p o o o - - O i O O
• • • • • • • 1 • • • • •

1

;
1

O

l

O

O

O

O

O

O

O

O

O

O

O

o o o

1

3 Z9 e g r\J UJ

T

<,

z aj o _ 3 --J
JJ

ac O

a
a

1
XJ

•p CO

LJ

LU
1—•

z

»-. C

RAN RAN

»-

CAL

a •—< 3 u.
O

U-

•kl -J,

UO ac i Q.
— i

& *y f\ \P ( N f, \ i 0 ^ i P O r s j f \ j i - t r r i ^ 10* 0> o ; ^ - * > t ' - ^ > t ' ^ o o i > t ' f N j r v j o r o > t If*- r O O O O r o m - M l o o O O v n ^ O O

o o o o o o o j o o o o o o i c o • • • • • • •>• • o• o o• o• o• o • o •i o o o o o o i o !
i

o o

o

o

(U H G nj fK i cu

o•
UO X

uO U J O u Q UJ z uO 3 t— 1 UO uo - w UJ UJ

UJ

1
tf\

00

z

ac
i

a ac
S iTi
JJ

•

UJ'

o.

— *

**!
-j| <i

_J

UJ 1—

O L L EC T l DBAC DBAC

ex.

SION

•i

!

i
i

REh OBSE R V E

o < u ac *.' * ac: < z 3 LU UJ uO < *: •
ac ac
3 -L

X X

X
0C

1

o z

a z

< < * ac — >
X

m* O

O SO °D 0s O O <N |oo i n f - r»- ac ^ i f t i f t «0 ^ !>o «M o o < » *+* O 0> 0 s <N ^ KNJ f\J 7

>t|Lor»~r*- > r « t i r » j o ^ ^ ^ > o
"O O

Q

UJ r o

«t

ac

• m

m

•

00 CO —* O C O O O ^ < O N | ^ t ^ t • • • t rvj r n - - -* OJ <\j r \ i f \ j r e - M *-< K» ru

• • • • •

* • *
1

LL r in U. r - »

• r: o o a
H •H

01) •H

1

O
uO O

; !

a

Q

O

UJ -r X z < — o >

UJ UJ UJ UJ u. U_

»-4 1

O
QC

o w ac' a.;

UL

a * o
o
-\J

at: UJ •— - J uol»- - J J- 3 CL) u .
• ,

_ <
O
JJ JJ

2!
•»l _J _J

o!

a z (/> cc ^ z < o ac z a. ac
•—•

fNJ (fNJ f\J (\J

>*>0 \yQ o >o ao oo o •4

JO O ^ —

ro oj

*• r g rvj rvj f\J

o

I>o >o

CO UJ ht—

o

X

X Q UJ UJ ct u_ u.

!

z • •< —I JX
UJ

X

o

1
-T* C7-

u.
H-

o o o ur\

UN *

m

O

O

O

O

O O

>J ) -ULI

— u,• z
«r ac o a

uO z <N» o < Z X or * o o o v J a. .. ..

z

OJ O 1— UO <M X Z Of

<; o1
UJ 3C;

o o
—4 . - * r*. p -

O

3

rsj rvj f\J <N

^ a^ >r -4 ^ o m <r M * ) r * ro o> o ^ o *«o .•*> >r O fNj 4 3 »/\ O >• *
^ - 4 r g r\i eg

ac a UJ < O 3 a» ac
UO < -j

o
UO UJ UJ QC UJ Q u. O ac UJ CO X 3 Z 1 1

eg CM

* z < ac
1

X 3 UO ^0

>• < u.
i

• • UJ — X O a:
-J

o

< 3
uO

|
1

z

o
-J

UJ UO

a.

L-J

y> UJ

5 * < u ^ * o < CO a UJ
UJ a. 1

_ * or ac _* a UJ 3u u. 3 -JB or a*
li -M ^1

UJ •—•

uo

»
UO UJ, * l 3 '

^ < z O <: z t
— 4

1— •i

< a

nvooou*>'nm|oooooo O O
-^'j^^Mf*. U*\|i/> f^» uT\ * -t ^ ! O < M " ^ f O r * > 0 0 H H N N * O
i •

u. *

o ao *> o

• • — X o <

»
UJ

z > ^ o *
k—

A < JJ
X

o o :> X 3 < U.
uO

-^ : : ,
1

J u. Z 3

J
XI -J

*m

• * » 3 « ~*

o — >• — u-• u.

o

•J

UJ UJ -J -J

ml — • » -»i z 3 »-! a.
3 Z o

1 : i

! i i

M t

i
UO

i
O H f N j r ^ i . n - O K

!
1
1

o
k - t

O ao uo < ao
N*

< o t> ** LU •-• -J z a ? o.

u.
3

z •• > UJ *
-J UO

CO JO* O

• > ~

3
X

o z ^h « •<
vO UJ

CD

1

< / >

-

»-o » — H. < »uO

or!
|0L * <

oca
X O O -J Z X O Z

•
< 1 1 at

• o • o • o • o • o • o • o • o • o •- # 1
i UJ ct

Q UJ Z
• M

CO

UJ X

o
X

z . o ac sc

- J O

ac

X •3 o

ao

>• -o

1-56

performs statistical averages for the previous search signals end of job

AVERAG

(parameters), (parameters),

L

STOP

4.

Basic SMART System Flowchart The SMART routines fall into two categories: routines that can be

called with control cards, and routines that can only be called by other routines. The latter set in interconnected by means of complex internal

vectors, designed to make the most efficient use of in-core storage. A flowchart is produced in Figs. 29 — 33. by control cards are enclosed by boxes. The routines which can be called

1-57

G

bO C O CD C O •p fH

•H bO

C
•H -M O C^

a;

'» (
C D bO

U 0

rtf U
C D

H

> <

2

\

H 01

c C D ^ m B U <M 71
C D C)

c o

a

> n1

0 'd

bO
•H

n CD w
fc G
)

luste Routi

C O ^ O C -H D C O C D

C
\
N

3 e o
H O ^ C
(1) >

o

rches
ai C D C O
4- J

C O U P

bfl bO C £ •H •H •H r r M O rr: C D 3 0

4

J

a

& (H
..a a
(0
l

w
u ^ c o •H c u m -P C D 00 a o c*

n 2 a c Tr:I (/) CO a)
}H

. H

fc o1
<i» C O

cr>
CM

H

o
C D

T:

A::

fc C D
4^ C O

^

\

o 0 fc 4-' m CJ 3
<D
C O

;/) 2
n 1

a ,n T1
0)

C D 4 » C O
(./I

01

>^

bO •H U,

•

E-H

>

C H D O

C a) ctf MH

tf

< (/)
: • :

& +->

° o
C D

C O C D -H C 3 -H CU 4->

a co - H
4-J
^i

co ^
O

2

CD 4->

>

M

C

O
P^

3

\

cu £ a
C 3 CD H C >

X
P
•*-•

O

C O C D

4J

(1)

Rou

C D £ E- M

X

3

cu • H

c \

4-»

CO P (D

>
O O

-H

1-58

rd

o o

g
^ s —

> > p. M 0) W 4H 3 W

(15 >, rxJ (T5 C XI O O -H 3 P C O O w «H nd C D H £ X3 C D rd P P C O ^—^ C O 3 P 3 (d C O C D Xl P ^ O ,

P.

o a c p o
P 4J H
C)
4
J

v \ \

O «H rd p > P O C P D C) P L > O.T1 C D 0 ] rd P CD rd T) bO P O C C D O -<-\ C C D C O C D P 3 bO

C D C O

a

s H
O \
X

¥ry>

C TJ O C D 4J C O

hH

Q) ^ H p
CD T3 O £

^ w ^

^ ()

TAX

^

> H

00

ItX -

C O i^i C O O •H p C O O >. C D H > tT5

£ C O C D O C C rd >^ D O P rcJ rd C C X D D >>

l+H •H C O ' ' W M ^ C •H O H ( rM KH \ a 'll 00 \(D U M Xl 0 O

m
£

i C D ^
PH

a, >

c a , c rd o o

>> H C O

^\
\co
g

rd P

n -o H
P
C D

rd 3
C

TJ C D P rd P. C D

ata
rd
C o

.CD

X rd P

co
<D

«H

o
>

C
CD bO

C D -H C XI > ^ fd
4 »

P. O
MH

c e C D 3
P P

CO p CO CD C X) O C O D W J W p (^ D < W C D P^ D >

3: •H
rd
P (1

CO

C •H P :i

p

c

C O D O, O

o o

o r^
P :i
si

C O

^
O P
X C O C D P P O P 3 (X •H
r

a

^ 0
P o
aj

nd
rd C 0 P C D C O

O e P O O C O nd rd

^^
S H \—i

o
P C O

-^ x\

o x

P

n3 - P .

>

X CD

S

3: W
r

> i

"^ O
C D >

<
MH

O •H C O P P C D

£

CO CD C O (D O O ^ O*

P O

co n

£ -H

>, o o
C O D 3 -H

P
C D

£
3

>

C

—S o o
Jf
O P

W D -W D

C £ O T3 TJ J C CT1 D P 3 bO C O O D O O •P P C - P £ MH X I MH C n j O

to • H O P C D -H O 'U C/) / O Tl P C C O ^H H H ^ / r d 0 • H M H C) H3) - H M H HH a, 525 H

.-:

rd
P rd X c)

o
bO

^

^

^

e
MH

>,
^H

rD P (O

MH

•H

>> o O
C O >i • H XJ rd 'il C D

cy
CO CO CD T >

^; M fc
M cy:

n n ( CD a
Njz:

c o ^
a

H X

w H

w ^ ^
/•CO ^r^XcD ^ O ^ O ,_q p W CU O C O £ rd P X C D rH T3 H C P rd - H O

I-H

^ H ^ O ^ rd
P •H

C D O o ^ -H •H C^ p 4^ C D O O G P. CD • H O «H MH C TJ ^ O H

C C

id fl

>>
CO

^ ^ < :•: oo

M ^ H ^

55
M
00

MH

>i

MH

>,

0

(

-3 <
P-,

n

P
bO •H

rd C O C \ C •H •a O C D O 0 x 3 «H • H rd ,y 1 -a CD x : cr p 4 ^ C D ^ bO CD a o 0 H «H ^ CD •• 1 r CO T J rd ^ MH

-H

>,

a G C

^

a ^ a a :•:

a,
•H

C C C

n o

P

rd

3 a o c o
P O
MH

"^ y;
O H X

H

x a, p a. C C o D D
C O P co £

C D P rd O p U C rd X3

^
H

:

C O rd ^ 'X O
MH

_r^:

a
P
MH C O

a

X 00 < K

C O B CD P T) O O MH O

r

d P O £

w
P E-«
4 j

X ^-^"
S H hH

c

a;
o
C O (D
l

X

(D C O rd 4-J
r

\

+

C D J

4-l

0 p a;

rd
rj rd

X -H C O D P C D (^ C co O rd U co MH

P C D P O rd rd O

:•' w s

0

^ ^

o o x:

P p C •H O

C J 0

1-59

CRDCOL

RESCOL

LSTCOL

\

adds a vector collection to the system or modifies an existing vector collection

\

inputs the search results of a collection for averaging

lists a vector collection

I

LSTITM \ lists an item (such as a document or query)

LSTITM

SMART Systems Chart — Vector Handling Routines

Fig. 31

AVERAG

'computes different kinds of averages for search runs

LSTAVE

lists and plots the graphs for the averages calculated by AVERAG

VERIFY

runs three kinds of significance tests of pairs of averaged results

SMART System Chart — Averaging Routines

Fig. 32

1-60

C O p rH

3

C O (D P. C O (1)

G 0

(/) rd

G 0

•H P C O C D P O G C C) C D 3 P P

0

s results d
p •H

P cp P. O

rd

rd
(H

H
rH

0

0

H"

p

o
p. G
C O

W

D
W

c
P

D cy

MH

G G

0

C O

G
C O

p G rd
CD rH C D P

l •H
C O

P

G

G •H bfl P

o

>
o

p

P •H C O C O (0 O

G a) F, G
0 0
P

C D P •H XI

lacD

P 0

u
o
(O

P
•H

rd
H3) CO CD <D «H •H MH P -H f d CD O G 1 CD^ W CJ C
CD CO CO

o»

P

£

G H o rd • H 0 P

duct

CO CD

C D
rC p C O

o p

G
CO

P.
O

P. p

P MH rd O P
H

P G E G O O T)

CO CD

y

. V H

/

(1)^ C P O -H p P rd O rd bO U rQ ti

'

"V

e P

p C p D

s

cp • H

o a

G

Ip

o o C D

•H P G O r^

o

>

o
H rd
CD

CP

CO CO

< ^ O
o

cp

CD

co G •H rd

MH
O G C O D 3 -H
(U -M

G co rd C D P P G C D H
H W rd

o .y

p G G

•H P

a
CO

QUEU

C O P G CD

,-C P -H £ T)

w
G G P
co
^ P

+J

^. ^^\ L ^ ^

e
C O (D G O C D O r Q C PD

G -H rd B

G 1 rd G g > rj O H C D O O P >> cd H O.H X)

e
CD P CO >i CP
I CD

Pi
-O •

•H nd (3 <P • H ,G P L) O CD «H CD P CO £ rd

rd
(D P P

rd

CD

O O

G O

a)
G 0

P

>

0
p •H P P G C D p ,G

G
•rH

G

P

0
rH C D

-H i •H
•P

a)

C D

o

H

W

osit cons rue

efi uer eed
P
U" 4-

^

O •rH ,C P O co rd P H O rd rQ O P. Mn P - H fTj G H3) O O <P O B O

G

fd

C O C D H

C D C O C D •H
11 1

P

p

o
4 >

CD

w
.O' ' Q

•H

31
CT
MH

•H P

P
C D

o

0 H

e P> P

o G o rd rd
C D D C D C P P
CD P P

X

ated ance group
p

PM CO

o

"V

(/I

G G a) P rH

>

rH

G

e

a

C O p C D H ,G p P CD 0 • H , P CJ • H 0 O C D C rG P 0 C O -I ' P

rd a

rd

P C D C O G

<
S

a rd

rd 3 =

>
C D H

G pa;
C D

rd
U 0
'I I 0
-I

^ P
P p

P 0

G CD cP P

a)

C P O C D

^ 0
rd

D
. J

J
D

O

G

>> ^

<
-PQQ O SB

>
P
0

C D

p

a)
S H
hH rC P

P CO P

H

) ( < N \'d a

Q rP

p

(0 P P

C D P C D P CJ • H G G O

p

B
0

p C D cp Q

o

rd

0 rd p

INE
w
p

0

O

>.

1-61
i '

bO

P
•H 'U

p* 0 •H P p (!i P ) P Pi

P:

c o

C O
1'

p
(U

p 0
'p

p

4 »

P

P

o
cP

p

o u u

rd
B
rC
(7]

C O
4 >

/^
C D ,P I ' C O

U p
M )

P P U
O 'P
(1) ' p P (U
•1 i M 1

0
P (h p 0)

<o '
C D

o

M
.. I P

C D
4 >

>
o
4 ' •1 i

P
r 1

C O I >

0
bO

p

c u
P

•P P CD

p 0

p

:--

P P
T 1

'11

0

O co

<
(U -P O O

3 / H
/

w

P^;

i

O O
CU PH

O
4->

P
Pn"

O

PH

p
r 1 (11

0 (D P QJ P

0
C O

p
1 i

>
P
rP

(!)

(/)
CD

H
00

w
D
PP)

()'
(i)

P
•H O

Q,X
C O

,,< (!)
p
p p C O

J O

w t!
(ii

,-c •*-!
p

£
0

: I

I

cy

:D

p
•r 1

p
•- 4 4 J P

ai
4 »

o

.. 1 P P O Pi

P

P

blj

p
T 1

p
(1)
4 •'

H
p

P

a
0

a o
T3)

0
p bO

bO H id C O

P

•P

•. I P CD P CO

p
c )
[ < i—i
PL,

:1 CO

C O

C
'.-i

p •H
(j

-fd
0
t ' 4 ^

a) nd H

(/) P
H
O

0
HI

a a

Q

(11 (P

<

r ) r i

p p

\

'P

0
P P P P P

w
XP M f-M bl Pb

.-.: a (/) a (\) 4 ' (/)
P, CO F <

(15

P) •H In

c u
U

ach

'p

•p

C U p
T |

PH

0
4 ' p

X)

:•>

<

rP

o
p
p

l+H

0
4 »

0 G

c u
C

p

C D ,P
I »

) (/) • CH ' (0) O
n,
(3 P. P P

C •i i O p C •H O G •P bO a; 0) nl P PJ C O p p C O P 0 -4 0) <1) 0 U p •H X) p • H C O •H

+ » C D

0)

>

a) a
,-P 4-J

C D P

a;
l

B
O
4i

P 0
(J C D ,P
4_j

P
O

P
P P

0 rd

fc
C O

P
P

!/)
a

p
(U P C O

p C D

P <D

0
4 >

p

P
•H <P P

+J
rP

MH
(])

p
.-1

''1

u

a.) .a
p

C H U

3
O

C p O P p.

o

a) o

,p

TJ

a >

a)

Pi d) X J 4-' P •i I

(/)

£ PI
p

iT
w
M bn

w
Q

H Q P4

<

1-62

References

[1]

E. Ide, R. Williamson, and D. Williamson, The Cornell Programs for Cluster Searching and Relevance Feedback, Information Storage and Retrieval, Report ISR-12 to the National Science Foundation, Section IV, Department of Computer Science, Cornell University, June 1967. D. Williamson, The Cornell Implementation of the SMART System, Information Storage and Retrieval, Report ISR-14 to the National Science Foundation, Section II, Department of Computer Science, Cornell University, October 1968. G. Salton, Automatic Information Organization and Retrieval, McGraw Hill Book, Co., New York, 1968, chapter 3.

[2]

[3]

[4] J. J. Rocchio, Jr., Document Retrieval Systems —Optimization and Evaluation, Harvard Doctoral Thesis, Information Storage and Retrieval, Report ISR-10 to the National Science Foundation, Harvard Computation Laboratory, Cambridge, March 1966. [5] R. T. Dattola, A Fast Algorithm for Automatic Classification, Information Storage and Retrieval, Report ISR-1M- to the National Science Foundation, Section V, Department of Computer Science, October 1968. [6] R. Williamson, Centroid Searching (Tree Searching), unpublished paper, May 1968. [7] G. Salton and M. E. Lesk, Computer Evaluation of Indexing and Text Processing, Information Storage and Retrieval, Report ISR-12 To the National Science Foundation, Section III, Department of Computer Science, Cornell University, June 1967. [8] Cathy May, Evaluation of Search Methods in an Information Retrieval System, Term Report written for Computer Science 435, Cornell University, May 1968.