IY-1

IV.

Automatic Processing of Foreign Language Documents G. Salton

Abstract Experiments conducted over the last few years with the SMART document retrieval system have shown that fully automatic text processing

methods using relatively simple linguistic tools are as effective for purposes of document indexing, classification, search, and retrieval as the more elaborate manual methods normally used in practice. Up to now, all

experiments were carried out entirely with English language queries and documents. The present study describes an extension of the SMART procedures to German language materials. A multi-lingual thesaurus is used for the ana-

lysis of documents and search requests, and tools are provided which make it possible to process English language documents against German queries, and vice versa. The methods are evaluated, and it is shown that the effec-

tiveness of the mixed language processing is approximately equivalent to that of the standard process operating within; a single language only.

1.

Introduction For some years, experiments have been under way to test the effec-

tiveness of automatic language analysis and indexing methods in information retrieval. Specifically, document and query tests are processed fully auto-

matically, and content identifiers are assigned using a variety of linguistic

IV-2

tools, including word stem analysis, thesaurus look-up, phrase recognition, statistical term association, syntactic analysis, and so on. The resultir.g

concept identifiers assigned to each document and search request are then matched, and the documents whose identifiers are sufficiently close to the queries are retrieved for the user's attention. The automatic analysis methods can be made to operate in real-time — while the customer waits for an answer — by restricting the query-document comparisons to only certain document classes, and interactive user-controlled search methods can be implemented which adjust the search request during the search in such a way that more useful, and less useless, material is retrieved from the file. The experimental evidence accumulated over the last few years indicates that retrieval systems based on automatic text processing methods — including fully automatic content analysis as well as automatic document classification and retrieval — are not in general inferior in retrieval effectiveness to conventional systems based on human indexing and human query formulation. One of the major objections to the practical utilization of the automatic text processing methods has been the inability automatically to handle foreign language texts of the kind normally stored in documentation and library systems. Recent experiments performed with document abstracts

and search requests in French and German appear to indicate that these objections may be groundless. In the present study, the SMART document retrieval system is used to carry out experiments using as input foreign language documents and queries. The foreign language texts are automatically processed using a

IV-3

thesaurus (synonym dictionary) translated directly from a previously available English version. Foreign language query and document texts are looked-

up in the foreign language thesaurus and the analyzed forms of the queries and documents are then compared in the standard manner before retrieving the highly matching items. The language analysis methods incorporated into the Thereafter, the main procedures

SMART system are first briefly reviewed.

used to process the foreign language documents are described, and the retrieval effectiveness of the English text processing methods is compared with that of the foreign language material.

2.

The SMART System SMART is a fully-automatic document retrieval system operating c _ :

the IBM 7094 and 360 model 65.

Unlike other computer-based retrieval systems,

the SMART system does not rely on manually assigned key words or index terms for the identification of documents and search requests, nor does it use primarily the frequency of occurrence of certain words or phrases included in the texts of documents. Instead, an attempt is made to go beyond simple

word-matching procedures by using a variety of intellectual aids in the form of synonym dictionaries, hierarchical arrangements of subject identifiers, statistical and syntactic phrase generation methods and the like, in order to obtain the content identifications useful for the retrieval process. Stored documents and search requests are then processed without any prior manual analysis by one of several hundred automatic content analysis methods, and those documents which most nearly match a given search request are extracted from the document file in answer to the request. The system

may be controlled by the user, in that a search request can be processed

IV-H

first In a standard mode; the user can then analyze the output obtained and, depending on his further requirements, order a reprocessing of the request under new conditions. The new output can again be examined and the process

iterated until the right kind and amount of information are retrieved. [1,2,3] SMART is thus designed as an experimental automatic retrieval system of the kind that may become current In operational environments some years hence. The following facilities, incorporated into the SMART system for

purposes of document analysis may be of principal interest:

a)

a system for separating English words into stems and affixes (the so-called suffix
T

s T and stem thesaurus methods) which

can be used to construct document identifications consisting of the stems of words contained in the documents; b) a synonym dictionary, or thesaurus, which can be used to recognize synonyms by replacing each word stem by one or more "concept" numbers; these concept numbers then serve as content Identifiers instead of the original word stems; c) a hierarchical arrangement of the concepts included in the thesaurus which makes it possible, given any concept number, to find its "parents" in the hierarchy, its "sons", Its "brothers", and any of a set of possible cross references; the hierarchy can be used to obtain more general content identifiers than the ones orignally given by going up in the hierarchy, more specific ones by going down, and a set of related ones by picking up brothers and cross-references; d) statistical procedures to compute similarity coefficients based on co-occurrences of concepts within the sentences of a given collection; the related concepts, determined by statistical association, can then be added to the originally available concepts to identify the various documents; e) syntactic analysis methods which make it possible to compare

IV-5

the syntactically analyzed sentences of documents and search requests with a pre-coded dictionary of syntactic structures ("criterion trees"} in such a way that the same concept number is assigned to a large number of semantically equivalent, but syntactically quite different constructions; f) statistical phrase matching methods which operate like the preceding syntactic phrase procedures, that is, by using a preconstructed dictionary to identify phrases used as content identifiers; however, no syntactic analysis is performed in this case, and phrases are defined as equivalent if the concept numbers of all components match, regardless of the syntactic relationships between components; g) a dictionary updating system, designed to revise the several dictionaries included in the system: i) ii) iii) iv) v) vi) vii) word stem dictionary word suffix dictionary common word dictionary (for words to be deleted during analysis) thesaurus (synonym dictionary) concept hierarchy statistical phrase dictionary syntactic ("criterion") phrase dictionary.

The operations of the system are built around a supervisory system which decodes the input instructions and arranges the processing sequence in accordance with the instructions received. The SMART system's organization

makes it possible to evaluate the effectiveness of the various processing methods by comparing the outputs produced by a variety of different runs. This is achieved by processing the same search requests against the same document collections several times, and making judicious changes in the analysis procedures between runs. In each case, the search effectiveness is evaluated

by presenting paired comparisons of the average performance over many search requests for two given search and retrieval methodologies.

IV-6

3.

The Evaluation of Language Analysis Methods Many different criteria may suggest themselves for measuring the

performance of an information system.

In the evaluation work carried out with

the SMART system, the effectiveness of an information system is assumed to depend on its ability to satisfy the users' information needs by retrieving, wanted material, while rejecting unwanted items. Two measures have been

widely used for this purpose, known as recall and precision, and representing respectively the proportion of relevant material actually retrieved, and the proportion of retrieved material actually relevant. [3] (Ideally, all rele-

vant Items should be retrieved, while at the same time, all nonrelevant items should be rejected, as reflected by perfect recall and precision values equal to 1 ) . It should be noted that both the recall and precision figures achievable by a given system are adjustable, in the sense that a relaxation of the search conditions often leads to high recall, while a tightening of the search criteria leads to high precision. Unhappily, experience has shown

that on the average recall and precision tend to vary inversely since the retrieval of more relevant items normally also leads to the retrieval of more irrelevant ones. In practice, a compromise is usually made, and a per-

formance level is chosen such that much of the relevant material is retrieved, while the number of nonrelevant Items which are also retrieved is kept within tolerable limits. In theory, one might expect that the performance of a retrieval system would improve as the language analysis methods used for document and query processing become more sophisticated. not to be the case. In actual fact, this turns out

A first indication of the fact that retrieval, effec-

IV-7

tiveness does not vary directly with the complexity of the document or query analysis was provided by the output of the Aslib-Cranfield studies. This

project tested a large variety of indexing languages in a retrieval environment, and came to the astonishing conclusion that the simplest type of indexing language would produce the best results. [ M] Specifically, three

types of indexing languages were tested, called respectively single terms, (that is, individual terms, or concepts assigned to documents and queries), controlled terms (that is, single terms assigned under the control of the well-known EJC Thesaurus of Engineering and Scientific Terms), and finally simple concepts (that is, phrases consisting of two or more single terms). The results of the Cranfield tests indicated that single terms are more effective for retrieval purposes than either controlled terms, or complete phrases. [4] These results might be dismissed as being due to certain peculiar test conditions if it were not for the fact that the results obtained with the automatic SMART retrieval system substantially confirm the earlier Cranfield output. [3] Specifically, the following basic conclusions can be

drawn from the main SMART experiments:

a)

the simplest automatic language analysis procedure consisting of the assignment to queries and documents of weighted word stems originally contained in these documents, produces a retrieval effectiveness almost equivalent to that obtained by intellectual indexing carried out manually under controlled conditions; [3,5]

b)

use of a thesaurus look-up process, designed to recognize synonyms and other term relations by replacing the original word stems by the corresponding thesaurus categories, improves the retrieval effectiveness by about ten percent in both recall and

IV-8

precision; c) additional, more sophisticated language analysis procedures, including the assignment of phrases instead of individual terms, the use of a concept hierarchy, the determination of syntactic relations between terms, and so on, do not, on the average, provide improvements over the standard thesaurus process.

An example of a typical recall-precision graph produced by the SMART system is shown in Fig. 1, where a statistical phrase method is compared with a syntactic phrase procedure. In the former case, phrases are assigned

as content identifiers to documents and queries whenever the individual phrase components are all present within a given document; in the latter case, the individual components must also exhibit an appropriate syntactic relationship before the phrase is assigned as an identifier. The output of Fig. 1

shows that the use of syntax degrades performance ( t e ideal performance .h region is in the upper right-hand corner of the graph where both the recall and the precision are close to 1 ) . Several arguments may explain the output of Fig. 1:

a)

the inadequacy of the syntactic analyzer used to generate syntactic phrases;

b)

the fact that phrases are often appropriate content identifiers even when the phrase components are not syntactically related in a given context (e.g. the sentence "people who need information, require adequate retrieval services" is adequately identified by the phrase "information retrieval", even though the components are not related in the sentence);

c)

the variability of the user population which makes it unwise to overspecify document content;

d)

the ambiguity inherent in natural language texts which may work to advantage when attempting to satisfy the information

IV-')

o a Precision 1.0 .8 .6 4

o a

Statistical phrases Syntactic phrases

^

-Ideal Performance Region

Recall 0.1 0.3 0.5 0.7 0.9

Precision o—o • .960 .834 .769 .706 .546

D

w
J I I L

'x O

.938 1 .776 .735 .625 .467

2h
.6 .8 1.0 Recall

Comparison Between Statistical and Syntactic Phrases (averages over 17 queries)
Fig. 1

IV-10

needs of a heterogeneous user population with diverse information needs.

Most likely a combination of some of the above factors is responsible for the fact that relatively simple content analysis methods are generally preferable in a retrieval environment to more sophisticated methods. The

foreign language processing to be described in the remainder of this study must be viewed in the light of the foregoing test results.

4.

Multi-lingual Thesaurus The multi-lingual text processing experiment is motivated by the

following principal considerations: a) in typical American libraries up to fifty percent of the stored materials may not be in English; about fifty percent of the material processed in a test at the National Library of Medicine in Washington was not in English Cof this, German accounted for about 25%, French for 23%, Italian for 13%, Russian for 11%, Japanese for 6%, Spanish for 5%, and Polish for 5%); [6] b) in certain statistical text processing experiments carried out with foreign language documents, the test results were about equally good for German as for English; [7] c) simple text processing methods appear to work well for English, and there is no a priori reason why they should not work equally well for another language. The basic multi-lingual system used for test purposes is outlined in Fig. 2. Document (or query) texts are looked-up in a thesaurus and re-

duced to "concept vector" form; query vectors and document vectors are thein compared, and document vectors sufficiently similar to the query are withdrawn from the file. In order to insure that mixed language input is pro-

perly processed, the thesaurus must assign the same concept categories, no matter what the input language. The SMART system therefore utilizes a

IV-11

ume

c

to ji

o

JZ

to

ish

o> c UJ

*~*

k

JZ

h-

esau
I
£

o O

k.

3

1

o> L

1

c UJ

1

N

c <D

\ % o

4 > U) >i CO biO

o o 1 I O 0)

3 -*- 1

*, \ in (/)
ci) 0

C

()
PL,

U)

JL Search Routines Retrieval Output

X
(1)

E<
d) hO u1

r;1 , I t:

5
/k
C)

Query Vector

IV-12

multi-lingual thesaurus in which one concept category corresponds both to a family of English words, or word stems, as well as to their German translation. A typical thesaurus excerpt is shown in Fig. 3, giving respectively concept numbers, Englsh word class, and corresponding German word class. This thesaurus was produced by manually translating into German an originally available English version. Tables 1 and 2 show the results of the

thesaurus look-up operation for the English and German versions of query QB 13. The original query texts in three languages CEnglish, French, and It may be seen that seven out of 9 "English" In

German) are shown in Fig. 4.

concepts are common with the German concept vector for the same query. view of this, one may expect that the German query processed against the German thesaurus could be matched against English language documents as easily as the English version of the query.

Tables 1 and 2 also show that

more query words were not found during look-up in the German thesaurus than in the English one. This is due to the fact that only a preliminary incom-

plete version of the German thesaurus was available at run time.

5.

Foreign Language Retrieval Experiment To test the simple multi-lingual thesaurus process two collections

of documents in the area of library science and documentation ( t e Ispra .h collection) were processed against a set of 48 search requests in documentation area. The English collection consisted of 1095 document abstracts, The

whereas the German collection contained only 468 document abstracts. overlap between the two collections included 50 common documents.

All 48

queries were originally available in English; they were manually translated

IV-13

230 ART 231 1N0EPEN0

ARCHITEKTUR SELBSTAENOIG UNABHAENGIG

232 ASSOCIATIVE 233 DIVIDE 234 ACTIVE ACTIVITY USAGE 235 CATHODE CRT DIODE FLYING-SPOT RAY RELAIS RELAY SCANNER TUBE 236 REDUNDANCY REDUNDANT AKTIV AKT1VITAET TAETIGKEIT DIODE VERZWEIGER

r-

237 CHARGE ENTER ENTRY INSERT POST 238 MULTI-LEVEL MULTILEVEL 239 INTELLECT INTELLECTUAL INTELLIG MENTAL MIND NON-INTELLECTUAL 240 ACTUAL PRACTICE REAL

EINGANG EINGEGANGEN EINGEGEBEN EINSATZ EINSTELLEN EINTRAGUNG

GEISTIG

PRAXIS

Excerpt from M u l t i - L i n g u a l Thesaurus Fig. 3

English Query QB 13

Concepts 3 / 19 / 33 / 49 65 / 147 / j 207 / 267 / 345 * *

Weights 12 12 12 12 12 12 12 12 12

Thesaurus Category computer, processor automatic, semiautomatic analyze, analyzer, analysis, etc. compendium, compile, deposit authorship, originator discourse, language, linguistic area, branch, subfield concordance, keyword-in-context, u ii KWIC 1 bell anonymous, lettres

/

common concept with German query words not found in thesaurus

Thesaurus Look-up for English Query QB 13

Table 1

German Query QB 13

Concepts 3 / 19 / 21 33 / 45 64

Weights 12 12 4 6 4 4 12 12 6 12 12

Thesaurus Category Computer, Datenverarbeitung Automatisch, Kybernetik Artikel, Presse, Zeitschrift Analyse, Sprachenanalyse Herausgabe, Publikation Buch, Heft, Werk Autor, Verfasser Literatur Linguistik, Sprache Arbeitsgebiet, Fach Konkordanz, KWIC schoenen, hilfreich, vermutlict anonymen, zusammenzustellen

1

65 /
68 147 / 207 / 267 / * *

/

common concept with English query words not found in thesaurus

Thesaurus Look-up for German Query QB 13

Table 2

LV-16

• FIND QliBAUTHORS IN WHAT WAYS ARE COMPUTER SYSTEMS OtlNS APPLIED TO RESEARCH IN THE FIELD Of THE BELLES LETTRES ? HAS MACHINE ANALYSIS OF LANGUAGE PROVED USEFUL fOR INSTANCE, IN DETERMINING PK06A8LE AUTHORSHIP OF ANONYMOUS WORKS OR IN COMPILING CONCORDANCES ?

DANS WULL SEN* LfcS CALCULATEUKS ^UNf-lLS APPLIWUE* A LA R E C H £ R C H £ UANS LE DGMAINE DES BELLES-LETTRES ? EST-CE WOE L'ANALYSL AUTOMATIQUfc DES TEXTES A EJE UTILE, PAR ExEMPLE. POUR DETERMINER L'AUTEUR PROBABLE D^OUVRAbES ANUNYME* UU POUR FAIRE OES CONCURUANCfcS ?

1NWIEME1T WERDEN COMPuTER-SYSTEME ZUR FUKSCHUNb AUF DEM GtBIfcT DER SCHUENEN L1TEKATUR VERWENDET ? HAT SICH MASCHINELLE SPRACHfcNANALYSE ALS HiLFRhlCH ERwIfcStN, UM L.ti. DIE VERMUTL1CHE A U I O R E N S C H A F T 6EI ANONYMEN *ERK£N ZU BESTIMMEN ODER UM KONKQRDANZEN ZUSAMMENZUSTELLEN ?

Query QB 13 in Three Languages Fig. 4

IV-17

into German by a native German speaker.

The English queries were then

processed against both the English and the German collections (runs E-E and E-G), and the same was done for the translated German queries Cruns G-E and G-G, respectively). Relevance assessments were made for each English docu-

ment abstract with respect to each English query by a set of eight American students in library science, and the assessors were not identical to the users who originally submitted the search requests. The German relevance

assessments (German documents against German queries), on the other hand, were obtained from a different, German speaking, assessor. The principal evaluation results for the four runs using the thesaurus process are shown in Eig. 5, averaged over 48 queries in each case. It is clear from the output of Eig. 5 that the cross-language runs, E-G (English queries - German documents) and G-E CGerman queries - English documents), are not substantially inferior to the corresponding output within a single language (G-G and E-E, respectively), the difference being of the order of 0.02 to 0.03 for a given recall level. On the other hand, both

runs using the German document collection are inferior to the runs with the English collection. The output of Eig. 5 leads to the following principal conclusions:

a)

the query processing is comparable in both languages; for if this were not the case, then one would expect one set of query runs to be much less effective than the other (that is, either E-E and E-G, or else G-G and G-E);

b)

the language processing methods (that is, thesaurus categories, suffix cut-off procedures, etc.) are equally effective in both cases; if this were not the case, one would expect one of the single language runs to come out very poorly, but

IV-18

o c 1 o o
(/>
en
00

<fr to * o 0) h- £-. 00 O m ro (VJ — —

o >

o o
0)

'o <
<u

f^

00

h-

<u
0)
k.

in

O

call

E 3
o

E
CO <D
3 O 3 C _ —

kl
0)

CVJ <J>

0> 0)

cr
00

(0 ro

8

o> o
; /
^<0

0) T> 3

ro O O

CT J to
00

a:

m N- 0> o d d

o o
HOO

00 3
3 O 00

™

"o> c o> UJ c

Q)

Ijj

/
/
C O
3

H<*
00 CM

a>
<D 3

^
CVI

0)

00

CO

o
c o
i-

E
a> CD

•H

o

§A
<D

ro — in

10 O 10 CO CO ro co CM

00

- o
CO

>
00

o o c

CO

00

c a>
00 0) w-

O
A"

<

O

0) 3
JCZ
ij)

O T3
C

0)
3

T)

o
C

call

o o o>

E 3

O

E 3

«

o a> <t

CM O JO hCM —

h- iO
—

c
LLI

- o

cr o

E i-

LU

o> o> CD c

cr o c E o VE a>
ci)
fe-

<u Q:

m N- 0 ) o o o d d
ro

c o
00

o
CL

o
00

o

E

e>
c o
00

CD

o

1 I

IV

neither E-E, nor G - G came out as the poorest run; c) the cross-language runs are performed properly, for If this were not the case, one would expect E-G and G-E to perform much less well than the runs within a single language; since this is not the case, the principal conclusion is then obvious that documents in one language can be matched against queries in another nearly as well as documents and queries in a single language; d) the runs using the German document collection (E-G and G-G) are less effective than those performed with the English collection; the indication is then apparent that some characteristic connected with the German document collection itself — for example, the type of abstract, or the language of the abstract, or the relevance assessments — requires improvement; the effectiveness of the cross-language processing, however, is not at issue.

The foreign language analysis is summarized in Table 3.

6.

Failure Analysis Since the query processing operates equally well In both languages,

while the German document collection produces a degraded performance, It becomes worthwhile to examine the principal differences between the two document collections. These are summarized in Table 4. The following prin-

cipal distinctions arise:

a)

the organization of the thesaurus used to group words or word stems into thesaurus categories;

b)

the completeness of the thesaurus in terms of words included in it;

c)

the type of document abstracts included In the collection;

IV-20

Translation Problem Poor query processing or poor translation

Corresponding Observation

Observation Confirmed

E-E and E-G much better than G-E and G-G, or vice versa Either E-E or G-G much poorer than cross-language runs Both E-G and G-E poorer than other runs Either E-G and G-G, or else G-E and E-E simultaneously poor

No

Poor language processing

No

Poor cross-language processing Poor processing of one document collection

No

Yes

E-E: E-G: G-E: G-G:

English queries - English documents English queries - German documents German queries - English documents German queries - German documents

Analysis of Foreign Language Processing

Table 3

Characteristic of Collections

Document Collection English German 468 50

Number of document abstracts Number of documents common to both collections Number of queries used in test Number of relevance assessors Number of common relevance assessors Generality of collection (number of relevant documents over total number of documents in collection) Average number of word occurrences not found in the thesaurus during look-up of document abstracts
1 , , _ _ — . . , i . . . . . • • . . • • . - - . , . > . . . .... . . ... . . ... . .-.,-.»-.

1095 50

48 8 0

48 1 0

0.013

0.029

6.5

15.5

... - — . — ..,-. ... ,...

Characteristics of Document Collections

Table 4

IV-22

d)

the accuracy of the relevance assessments obtained from the collections.

Concerning first the organization of the multi-lingual thesaurus, it does not appear that any essential difficulties arise on that account. This is confirmed by the fact that the cross-language runs operate satisfactorily, and by the output of Fig. 6(a) comparing a German word stem run (using standard suffix cut-off and weighting procedures) with a German thesaurus run. It is seen that the German thesaurus improves performance

over word stems for the German collection in the same way as the English thesaurus was seen earlier to improve retrieval effectiveness over the English word stem analysis. [2,3] The other thesaurus characteristic — that is its completenessappears to present a more serious problem. Table 4 shows that only approx-

imately 6.5 English words per document abstract were not included in the English thesaurus, whereas over 15 words per abstract were missing- from the German thesaurus. Obviously, if the missing words turn out to be

important for content analysis purposes, the German abstracts will be more difficult to analyze than their English counterpart. A brief analysis

confirms that many of the missing German words, which do not therefore produce concept numbers assignable to the documents, are indeed important for content identification. Fig. 7, listing the words not found for document

005, shows that 12 out of 14 missing words appear to be important for the analysis of that document. It would therefore seem essential that a more

complete thesaurus be used under operational conditions and for future experiments.*

*A rerun of the experiment using a more complete thesaurus is decribed in the appendix.

Precision

AA O

O1

0> ^ C *t O O 9 N N- C O O in ro cvj — —

ro co — O co — oo oo ro co
10
CVI

—

—

O

o o

*4-

o

to
3 O CO

cr
$

Recall

O

to

+_ V)

a> c

— ro io

N

a>

ai H
//
00

.<*>

h0)

o
I

UJ
I

o d d d d\

T1
T3 -J

c o

c o

i A

x: 1^^
tO CD
i^

o>

E

<D

O CD

CO

3

<i
<j

o
o
CJ V

L. o cu i _ Q CD 0C

o >

0) CD

c o
_l
CO tO

I

l_
CM

o

N
CVJ

M
CO

<t ro —
IO

IO CVI CVJ

00 CVI

£ ** o o
0> NCO ro CO

9? T
=

o <

H
eca
a:
(O ZJ
k~

CO 00 CVJ

ro O
CO

0.5
N

o
en

O O CD

0.1 0.3

d d

a:
4
O

to

E
0)
k.

>

to

CD

O O

<3

I
C

ir /V

|2
CO
to 13
Q to CD

eri
Z3

o to

Z3

CO

^to

to CD

a c
o

E E k. k-

o Q c o

sz CD CD h- CD CD
CJ V

o
CO

o
<D

00

CO

CJ V

ces
to

a> O o c c

*""•*

</>

OUU

c
CD

E 13
O

UL UJ

X

(X

o i

»H

eg CM n» ^r %r u%, in w\ ir\ a> ' n vu >o

1M >T « ^ ( M OJ (>J > f ^» I M 1>J I N ^

H

1>J

ULU,a.LLU.LLULULULULU.U.CCUL

co co co co co cO co cO co co vi co co co

o
2

UJ

CO

2 -J UJ UJ--J tsl <c co o O UJ HH- to 2 •~ > 2 CO co co X < •-« oo H» ac UJ uj co vo 3 M L U U O > « I- D t O >• 2 0^J * c Q co ^j <« O u t at: -< < o x . O > X O a. u. u j a; X* x ^j 09 2 S z a: OCGC UJ O •J u O o 0 > U J U J O H 3 o x *« <i <c <* • a co or _j X UJ oo x O X tu o

S

c>

The other two collection characteristics, including the type of abstracts and the accuracy of the relevance judgments are more difficult to assess, since these are not subject to statistical analysis. it is a

fact that for some of the German documents informative abstracts are not available. For example, the abstract for document 028, included in Fig. 8,

indicates that the corresponding document is a conference proceedings; very little is known about the subject matter of the conference, but the document was nevertheless judged relevant to six different queries (nos. 17, 27, 31, 32, 52, and 53) dealing with subjects as diverse as "behavioral studies of information system users" Cquery 1 7 ) , and "the study of machine translation" (query 27). One might quarrel with such relevance assessments and with the inclusion of such documents in a test collection, particularly also since Fig. 6Cb) shows that the German queries operate more effectively with the English collection Cusing English relevance assessments) than with the German assessments. However, earlier studies using a variety of rele-

vance assessments with the same document collection have snown that recallprecision results are not affected by ordinary differences in relevance assessments. [8] For this reason, it would be premature to assume that the

performance differences are primarily due to distinctions in the relevance' assessments or in the collection make-up.

7.

Conclusion An experiment using a multi-lingual thesaurus in conjunction with

two different document collections, in German and English respectively, has shown that cross-language processing (for example, German queries against English documents) is nearly as effective as processing within a single language. Furthermore, a simple translation of thesaurus categories appears

-26

a m UJ * O J U C ^ J of
IU M J U J U J * - UJ Z < J LU f-4 UJ QC

< ,!E ; i o •:.;,
00

0C

Z • • —
UJ CD Z 3 Z O

<f UJ <t * - — 1 - hh(oaiM</) • H * Z</> -J UJ • M M UJ z z < CD fSl H X O U J U J ^ D ^ « a; UJ O ** *
* O Q ± CJ < LL H> Z CO 2 H<X >•* UJ LU UJ *•> H J J D 7 ^ J 7 UJ UJ UJ X LU LU X • X U * * CD O 3 *-* fsl UJ O O X Of Z CD

;«

UJ

o <r a.

< ee!

v^ vo UJ ^- X </> X u — M tt
0C UJ (ZD UJ -J hUL <t • M X O

^r UJ D |l cc «i o co of 7 U J h 3 u. z
LU X

- --JI
ao 3 a

*c <

o

5 3
LU
CD

U l UJ LU • KJ CC < LU 4/> CO ISl 0C «

e^
O 00 LL Of LL UJ

i/) LU O

o ^ h i u a h o M H UJ o oc «J <f Z LU O 00 »- 00 K 3 X > CC
IL « w f l O O K L U < CO > </> CL O X M U l k D Z O • K •-• 3 < O oo x O UJ X oo Z CM 3 O UJ UJ O UJ LU GO

3

X

•

CJ CD I / )

~ z of «

cc a < lUOUJ?Z

CD LU UJ CO CD UJ •-• O 3 Z Q Of Z O < 4/> LL CO LL
•—• •—4

^ M H | *~* CC O LL CD U J O Z IXI UJ i Z i D 7 Q C O D

kroo z z U

3 Z

z

O ~ o z of zujh-DUJ 3 3 0 z 3 < UJ a: o < x o cc GD of o LL
O H U J M

O LU Of 3 H O UJ LL *-< CO

z z > 3 LU j i o r J ^ O UJ *-• a: h- 3: oo z
0 0 CC 3 LU < CO LU « Z O

oo x a o co —

0 0 O CC CC « J O t• - > OL 3 < CD X 2 Of UJ X Z k U J l

X <

Of CD

V ) D O LL < LL K •-» 7 <

a: UJ a. GO
UJ

a: UJ
—

a: UJ X to UJ O C Q D UJ a: X UJ Z • O LU O LU ui ~ 3 co cc UJ M > M
UJ CD * U J 3 Z J Q CC X Z Q

- J 2 X UJ O -1 <t - J

U L J L o of x LL UJ co o of X oo ** o
< _ j u» CC LL h- C LU O

O 3 -* Z h* 3 C < D Z X oo 3 or oo _j O
O I J U . 0 C U J 7 C 3 h « Z 00

a: O UJ UJ LL UJ cc x CD <r C O O Z X UJZl/KO D - ^ I ^ h
CD Z 3 ~ _1 • _) <I UJ X H - Of O0 O 0 0 UL 3 Z O < « O O

co

LU cc 3 UJ

Z Z Z X UJ UJ UJ O 0 0 CC -J < ^ — X < L0 3 rsj « 00 00 UJ CC 0 0 UJ LU Of CO Of UJ UJ CD Z 3 Z •-« O I U J X 00 Z — 0 0 Z UJ LU

-

•

•

4J X C D H 4 >
11)

w

Z O H X O M U U

LU X

o > of

£ 3 o
Q
f ;

u a o

<

O O
UJ < CD Z 00 3 « CC X LU <

z

2:

o a: rK a. O
< Q

LL CC Z z LU O Q I ^ U O
M O J M H Q

LU LU 4 X 1 hliJU O C O o Z *-. O ~ _l UJ O <r UJ DC x - j h- p Z £ C O H < -J 3 CO Of UL Z LL LU » O » - < Z U J J < 2 : u. X < UJ x r>J U J CC < 2 CD fsj Z *2! < Of U J Z 7 7 UJ Q < o a i u ) D « CD Z UJ O X 7 UJ
: i

*-• u* 3 UJ UJ

a: a:

O ts\ O X 3 Z < 3 at uu o z
-J Z < 3 O •-

Of LU X U J U o0 Z LU O —

z

LU J -J LU h-

rrt
i(U

O
r i ft) 0

Z hO oo <
H ^ K

z z > a — o oo 00 3 3 hULi Z Of 3 3 f CD Z CD 00 O LU —

0 •H H

«~l LU «4- CD Z UJ ~* Z LU CD 2 O 3 I ** or 3 3 f\J LU X at O Q O U J Of O O OJ *O T X U. t ) O LU z < U . •< IL Z • «* * « r

o

z <

X x to o of u . z LU • - z < UJ z UJ U J D « O
Of < OL * UJ ~ > X O ^ C* 2 ! — —

h

O

h

J O U J

z < a: D1UJ

ao

hUJ < a: z&s: Ct Z UJ O h- X Z H-» a:

Z
LU

oo z o — x — UJ z
•

Q Q C l L Z X LL LU
1 / ) M M V 0

iu UJ -J

LU UJ LU cc Z co

—

O < UJ Of H> </> 00 3 O « LL UJ O is4 Z

3

X 00 Z

UJ LU 0C 00 - J X <I -& <t O 3 ir> z oo CC -4 0 — aD o — z LU CO > X LL f\J ^ O O Z LU • h - UJ UJ Z

of • - x 3 ^ to <x *- <r o L- *-+ X >c Z » — 0 0 C t O » — - J C D 3 < LU O O O Z CO X Q •—1 X UJ UJ UJ X < •— Of X of 3 U oo
Of 3 LU X CD U . Of Of Z UJ 3 LU X X

CC CD CD 3 Z Of co of 3 LU

^

-J
<

o o o o a
O

z o o o LU of o: z
— X LU 3 < Z

•

Z Of < OJ of X UJ O > X

z

a: K

D <

rv 00 CO L U Z U J D i D t f n : O

a x x o UJ • *CD Z X UJ 3 o Z cc o LU a: i/> > to co
UJ

x >~ o
Z

IU

h- — 3

*-4 t-l

• <+ « * ^

LU UJ h u o o j u j i : X < oo x •- X X z — • x o UJ < <T_ of Z O co a: cc Z U i T « * Of k h —1 •- O O 3 Of LU t - 4 Z U J C Z ^ U i M LU * ^ 1— ~4 *~* O > Q h-

* a

o

IV

to produce a document content analysis which is equally effective in English as in German* in particular, differences in morphology (for example,

in the suffix cut-off rules), and in language ambiguities do not seem to cause a substantial degradation when moving from one language to another. For these reasons, the automatic retrieval methods used in the SMART system for English appear to be applicable also to foreign language material. Future experiments with foreign language documents should be carried out using a thesaurus that is reasonably complete in all languages, and with identical query and document collections for which the same relevance judgments may then be applicable across all runs.

References

[1]

G. Salton and M. E. Lesk, The SMART Automatic Document Retrieval System — An Illustration, Communications of the ACM, Vol. 8, No. 6, June 1965. G. Salton, Automatic Information Organization and Retrieval, McGraw Hill Book Company, New York, 1968, 514 pages. G. Salton and M. E. Lesk, Computer Evaluation of Indexing and Text Processing, Journal of the ACM, Vol. 15, No. 1, January 1968. C. W. Cleverdon and E. M. Keen, Factors Determining the Performance of Indexing Systems, Vol. 1: Design, Vol. 2: Test Results, Aslib-Cranfield Research Project, Cranfield, England, 1966. G. Salton, A Comparison Between Manual and Automatic Indexing Methods, American Documentation, Vol. 20, No. 1, January 1969. F. W. Lancaster, Evaluation of the Operating Efficiency of Medlars, Final Report, National Library of Medicine, Washington, January 1969. J. H. Williams, Computer Classification of Documents, FID-IFIP Conference on Mechanized Documentation, Rome, June 1967. M. E. Lesk and G. Salton, Relevance Assessments and Retrieval System Evaluation, Information Storage and Retrieval, Vol. 4 No. 4, October 1968.

[2]

[3]

[4J

[5]

[6]

{7]

[8J

IV-29

Appendix To test the effect of the missing words in the German thesaurus, the experiments were repeated using a more complete thesaurus to which previoulsy missing entries had been added. The following table summarizes the differences in results, averaged over 48 queries as before (see Fig. 5 ) .

German Queries German Documents Precision Old New Thesaurus .513 .286 .181 .130 .066 .527 .327 .203 .140 .096

English Queries German Documents Precision Old New Thesaurus .490 .252 .170 .117 .065 .513 .299 .185 .122 .091

Recall -1 .3 .5 .7 .9

Recall .1 .3 .5 .7 .9

J
Average Precision at Fixed Recall Points Average Precision at Fixed Recall Points

It may be noted that an improvement in average precision of 2 to 5 percent results from the dictionary change. Even after the dictionary replacement,

the English collection produces better results than the German, the differences in precision having about 10 percent at most recall points. differences are due to one or more of the following deficiencies: These

a)

unavailability of informative abstracts in German;

IV-30

b) misspellings in the German text; c) a program limitation which limits all German words to a limit of 24 characters (no English words exceed this limit); d) discrepancies in the relevance judgments pertaining to the German collection; e) suffixing problems in German, particularly those dealing with single letter suffixes such as
!

s f , t n t , and T t T .

These problems may eventually be solved in further work with the German texts. The basic results pertaining to the cross-language processing are

in any case unaffected.