CHALLENGE PAPER: HOMEOSEMY — ON THE LINGUISTICS OF INFORMATION RETRIEVAL*
Hans Karlgren KVAL Institute for Information
A Need and a Tool? [1] Linguistics is necessary f o r retrieval systems. the design of future computer-based information

Science

This is a strong claim. Not surprisingly, several documentalists take offense when linguists propose it. They f i n d that the present design: of retrieval systems is not fundamentally deficient and needs polishing rather than remaking, a n d / o r they expect major contributions to come f r o m people w i t h i n the field rather than f r o m more or less ignorant outsiders. Insistence on a linguistic approach is often mistaken f o r an argument f o r use o f natural language. But our claim is more fundamental. We shall make [ 1 ] stronger by adding [2] Linguistics in i n f o r m a t i o n science is not restricted to possible natural language processing. that

In the f o l l o w i n g we shall take f o r granted that the reader shares our view [3]

mechanical retrieval systems as known today are very remote f r o m what they could become in a foreseeable future,

and that [4] the major restriction today is not in the amount of retrievable data, the availability of the service, or the f a m i l i a r i t y therewith among users, but in the selectivity o f the search methods.

On the contrary, i.he development so far has increased the amounts of data accessible as compared to the pre-computer period while something rather blunt has replaced an extraction procedure which was often, thanks to qualified and dedicated librarians, highly selective and adaptable.

*This paper was formulated as a challenge paper to guide the discussions. Admittedly disputable statements are preceded by bracketed numbers.

167

Hans Karlgren Consequently, [5] research and experimentation should be procedure testing, from building up still conventional methods -- the strengths and the study of small amounts of materials retrieval procedures. redirected from mass processing to more data bases to be processed by weakness of which are well-known, to with more complex and less known

Accordingly, [6] building new data bases, installing new enquiry terminals, or training users in the manipulation of existing systems should not be tolerated, unless such measures can be justified on the grounfc of their immediate profits without reference to possible research merits.

This view is in agreement with the principle of transferring research resources from areas of highly predictable results to such as we know less of. If a few large-scale but predictable data-base projects per year could be eliminated, the research on crucial retrieval problems could presumably be easily financed. This is an argument for long-range planning of this kind of research. A significant improvement of performance can certainly be achieved with any technology by making users more accustomed to the characteristics of that technology so that they play the game well. Thus, if users are taught to modify their habits of asking and of writing headings or summaries, etc., existing systems will produce better results. However, knowing how little we know about search methods and realizing how short our experience of mechanical information processing is altogether, [7] we shall normally reject every proposal for improving the efficiency of systems essentially by changing the habits of the users on this account.

It is a good question what change of habits is essential. The reasons for [ 7 ] are, among others, that [8] the improvement trend produced by such means cannot be extrapolated beyond a ceiling because there exist inherent limitations,

and [9] such a palliative treatment may be as disastrous as symptom-suppressing drugs administered to a person who is seriously ill; the temporary success may mask the need for long-range research.

Instead, the users' reluctance to comply with the formats prescribed by a system should be treated as a healthy reaction and be studied as such and not taught away.

168

Challenge Paper:

Homeosemy--On the Linguistics of Information Retrieval that

It might be tempting to specify as a target level of ambition [10]

a mechanical search system should perform as well on large data sets as a qualified and well-informed human does on small sets,

and that [11] the gain of mechanization should be wider coverage and faster inclusion o f new material (learning), particularly important in new interdisciplinary fields which are typically those where human oracles f a i l .

But this level of ambition, while unattainable in some respects, is probably too low in others. Our own experience f r o m retrieval of passages f r o m w i t h i n a single text - - say as little as 10,000 words of legal text - - reveals that, even so, humans f a i l to extract all relevant items. It is almost impossible to f i n d all relevant i m p l i c i t cross references to a given passage. He who has at his finger-tips all passages f r o m the Bible, or a set of legal codes, or a normally ill-structured computer reference manual may pass as an uncommonly learned man: [12] document (passage) retrieval may be n o n - t r i v i a l even f o r humans when the data set is very small.

Now, granted that there is a need for some fundamental new insights in i n f o r m a t i o n retrieval, it is still not obvious that the linguists are those who could supply the missing tool. If documentalists accept good human performance as a challenge level there may be some aprioristic reasons for expecting linguists to have something to say; they might be expected to know a little about how humans do it. But we need more evidence than t h a t The linguists, clearly, carry the burden of the proof for statement [ 1 ] . Let us, already at this stage, eliminate a possible compromise stating that the kind o f "linguistics" necessary f o r these tasks is a general science about Language, including such study of formal languages as mathematical i n f o r m a t i o n theory or f o r m a l logic. We do not mean merely that i n f o r m a t i o n science and linguistics unite in the most abstract spheres. Our issue is not whether this over-all study should be labelled linguistics or semiotics or informatics. By linguistics we mean the kind of knowledge peculiar to linguists, i.e., knowledge about certain properties of natural languages. The Retrieval Problem One can view document and other retrieval systems as a kind of question answering devices*. The request can be understood as " D o you have something like X X X ? " and the elicited offer may be understood as "yes, I have Y Y Y " , or "In a way. I do have Y Y Y . " We shall loosely say that the answer is identical to the question i f X X X = Y Y Y .
•It is true that the "request" for literature of a given kind may also be understood as a command, but that interpretation does not preclude the question status of a search question, since all questions can be described as having the deeper structure of a command tc supply information.

169

Hans Karlgren In an effort to distinguish between document retrieval, information retrieval in general, and other kinds of question-answering systems, we have found it useful to consider the following three kinds of question-answering systems, assuming the system at any one point in time to be deterministic in the sense that it produces only and all the same answers to any one given question. We have: order i: order ii: systems with a finite set of questions systems with a finite set of answers

order iii: systems with an infinite set of answers. Order i. This group contains systems like those for airplane booking or spare parts inventories. The possible questions are many, the usefulness of the system enormous, and the problems of design and implementation may be formidable in several aspects, but one thing about them is trivial: the relation between question and answers. The questions are all foreseen. The answers could in principle be assigned a priori to the questions. AH refinements over mere listing could be summarized as storage technique.* Order ii. This is the typical document retrieval system. Whatever the question, the answer is a list of document references. The set of possible answers is the set of all subsets of the set of document references. The possible answers are many but in principle known prior to the questions. Order iii. In the general question-answering system, the answers are derived on the basis of analysis of given input statements. The system can do more than reproduce statements that have been given to it. These are the systems which are expected to say whether a given substance will resist a certain load, whether a transaction is compatible with a given contract and a given set of legal rules, etc., etc. At this point, we are deeply into artificial intelligence and information processing in general. Now, the distinction between finite-infinite may be impressive in definitions, but systems designers derive little comfort from the finiteness of very large numbers. Very large sets are often better treated as though they were infinite. To avoid in this field a repetition of the long fruitless discussions in linguistics about the finiteness of the set of sentences, our tentative definitions above should possibly be amended so that the crucial fact is whether or not the designer can make use of the finiteness. Another consideration which blurs the nice distinctions above is the need for interactive operation as soon as the relation between question and answer becomes at all complex. Thus, a document retrieval system aiming at selecting a subset of a given finite document file may need to enter into a dialogue with the questioner. In addition to straightforward answers such as "No, we do not have that" or "Yes, we have the following suggestions..." there are other adequate responses. In fact, we will very soon find almost the same wide range of responses to a question as in other studies of the semantics and pragmatics of questions. The system may, in one disguise or another, produce replies such as "The question is unintelligible", "Yes, but that would be at least 153 000 items; do you really
•Text compacting is a fascinating field in itself and does have a bearing on linguistics. But here we are rather in the domain common to linguists and others who study code design.

170

Challenge Paper: Homeosemy--On the Linguistics of Information Retrieval want them listed?", "What do you mean by XXX?" or "Do you understand 'XXX* as 'YYY without ZZZZ'?", "Your question cannot be answered as it stands but do you object if we replace it by QQQ?", etc., etc. This meta-dialog may contain a potentially infinite number of systems responses, even if the number of ultimate answers is finite. Still hoping that our distinction is of some value we shall concentrate on order ii and we shall restrain the term information retrieval to this case of information found and lost In document retrieval, the crucial problem - - note the singular form - - is to match the question/request with the description stored about the document. This problem does not essentially- change its character if the description is of one kind or another; the full text of the document is just a special case of a description. If the items to be retrieved are not documents but, say, persons, patents, precedence cases, chemical substances, or processing methods, the retrieval must nevertheless operate on descriptions defining what can be offered. [13] Thus, in any retrieval system for every retrievable item there must be an internally assigned description, an offer. The general retrieval problem is to match requests with offers. The definition of a good match is far from trivial. We need less prejudiced criteria for a good match. In particular, the definition of matches and the evaluation of goodness of fit must be independent of proposed search algorithms.

[14] [15]

[16]

The actual design of retrieval systems requires much more than a good solution to problem [13], but that is the problem which distinguishes retrieval design from other difficult systems engineering tasks. Much has been written about the form and rnanipulability of descriptions. Perhaps too little attention has been given to the contents of the description. The adequacy of the description and the agreement between request and offer are different but related matters. The former concerns the relation between the object to be retrieved and its description, the latter the relation of that description to a request These two relations have been particularly confused when the object to be retrieved is itself a text*, as is the case in document retrieval, which therefore is a very special case. And it is clearly one for linguists to handle whether or not they are in retrieval systems design. It is necessary to have philological knowledge about the relation heading/body of text, etc. And there are essential linguistic problems in the definition of the topic of a text; this has to do with theme/rheme relations, concepts which are defined on the sentence level but not yet on 'text' level.

•Obviously, a (ext can be described like any object by extraneous properties; one would be glad to have in a scientific system such features as "original," "elementary," "echo-paper", etc. The case which is often tacitly assumed to be the only relevant ore is when only the contents of the text itself are described.

171

Harrs Karlgren Therefore, [17] We recommend a study of the relation between document text and document description to be made independently of algorithms for deriving a description from a text. Just as it may be fruitful to investigate what constitutes a good match without considering how the pair is found, it may be worth-while to investigate what is a good description for a given purpose without considering how it is obtained.

The book, Linguistics and Information Science, by Sparck Jones and Kay (1973) seems to lead up to the conclusion that systems of order ii could better be handled as special cases of systems of order iii. The restriction to prefabricated answers does not make the task essentially easier. Conversely, one could maintain that even much more powerful questionanswering systems could and should be designed as systems with retrieval components. Even a specially generated answer, individually phrased for the client, may be a simple function of partial answers retrieved from a set of input items (elementary statements or "facts"). Concluding, [18] an unprejudiced study of questions and answers is crucial,

and, in particular, [19] the role of presuppositions must be clarified.

On the one hand, it is clear that an information retrieval attempt may have as one of its major and intended results a reformulation of the question, based on the wider knowledge contained in the system (and not only in the documents themselves, if it is a document retrieval system). There must be a means for the answering system to reject presuppositions in a question without rejecting the question altogether. On the other hand, a questioner must be able to use a system even though he does not accept all the presuppositions built into the descriptions (which may have been phrased a long time in advance). It must be possible to come around moderate shifts in perspective, if we want the system to age slowly. This problem area deserves study in the light of the modern study of presuppositions. A Non-Linguistic Approach [20] Irrespective of their various algorithms, it seems that most techniques practised today are based on the following conception.

Request and offer are both phrased or rephrased in a retrieval language in which the matching (the "manipulation") is performed. (The "request language" and the "indexing language" are here seen as subsets of one operation language, just as questions and answers are subsets of one language; we disregard differences of format).

172

Challenge Paper:

Homeosemy—On the Linguistics o f I n f o r m a t i o n Retrieval

Since the original request is not always given in the retrieval language it must be translated into that language prior to searching and matching. Similarly, the original (document or other item) specification may have to be transformed into an offer expressed in the retrieval language by a translation procedure often called indexing. We then have the schema R->R': the given request is t r a n s l a t e d i n t o an e f f e c t i v e request susceptible to the manipulation needed f o r matching susceptible to the manipulation needed f o r matching

0->0': the given offer

is translated into an effective

offer

Subsequent manipulations operate on the R' and the O' and lead to assignments of O's to R's (or vice versa, i f you prefer). The matching of i. ii. Identity Partial identity R's and O's are based on one or more o f the f o l l o w i n g :

i i i . Some logical calculus (typically: Boolean algebra). Even when the computations are f a i r l y complex, consisting, say, o f establishing chains o f implications, they break down the comparison, in a few steps, to an assessment of identity or non-identity of primitives. These matching procedures in themselves are not regarded as linguistic procedures. Systems o f this design often do include substantial linguistic components, such as

i. Parsers or other tools f o r input analysis (automatic reformulation o f requests a n d / o r "automatic indexing") i i . Linguistics-inspired design of The latter ii.i ii.ii the retrieval language.

may include such "natural" features as word order as an expression o f semantic modifier/modified relations distinctions

i i . i i i links and rolls, "cases", and other relations analogous to the semantic relations in natural language. But the crucial problem of matching, the retrieval problem, is not treated as a linguistic problem. The underlying assumption is that once the request and offer have been transferred to the exact f o r m which the retrieval language stipulates, they can f o r retrieval purposes be considered as unambiguous. The rest is a jeu de rencontres.

173

Hans Karlgren

[21]

With this attitude towards retrieval, the use of linguistics is optional.

The designer may or he may not permit requests and document descriptions to be written in some more or less natural language. But, that is the idea, he may also impose severe restrictions on the questioner and thereby eliminate as much as he choses of the linguistic aspects and complications. Similarly, the designer may make the retrieval language more or less natural, but, and that is the implicit assumption, he is also free to define this internal logical representation independently. The need for identity on some level - - the R and O as wholes, parts of R and O, or primitives contained in R and O — requires a control of the retrieval language. This control is of a destructive kind and might in travesty of Reader's Digest's well-known slogan be summarized under the exhortation Decrease Your Vocabulary (and Your Syntax). The problem of variation in the expressions of R and of O when both mean the same or almost the same thing is met by trying to eliminate that variation. There are two kinds of such linguistic reduction: i. Elimination of variation of expression where no difference in meaning exists: standardization of usage. Standardization is sufficient only if the system is of order i ("trivial" question-answering). ii. Elimination of the variation of expression when the difference in meaning is small. The control then means a compulsion to use the closest expression from a permitted set (sometimes, called a thesaurus although its prime property is not to be rich but to be poor). The translation from R to R' and from O to O' then requires an approximation procedure: instead of saying what one means, one says something one does not exactly mean in the hope that the other person who is also not saying exactly what he means will have said exactly the same. [22] One could summarize the use of vocabulary control and other language control in order ii systems as an attempt to increase the probability of rencontre by reducing precision.

Precision is here taken in its technical meaning of degree of specification. The result will be, and equally so whether the restrictions apply to R and O or only to R' and O', that certain information will never be used and that the decision never to use it has to be made prior to searching. Whenever such decisions prove to have been premature, the selectivity of the system will be impaired. We conclude: [23] Language control which implies reduced precision of input data is an adequate means of eliminating chronically irrelevant information but is no general solution to the matching problem,

174

Challenge Paper:

Homeosemy—On the Linguistics of Information Retrieval point,

because, and this is a major [24]

in the general order ii case, it is impossible to know what i n f o r m a t i o n to disregard in a request or an offer until one has seen the partner of the match.

We need to make abstractions but we never know a priori which is the best direction to abstract into. A restricted language, to the extent it is restricted, forces us to make a priori decisions on what to discard. Thus, a description ( A , B, C) in a simple system with a f i n i t e set of descriptors leaves undecided whether this item is a special case of A B or of A C or o f BC; any one of the descriptors may, in a t w o - o u t - o f - t h r e e match, be disregarded. But the system did impose on the indexer the choice of exactly A in place of whatever near-A was originally given. A Linguistic Approach

The Dream of the Ideal Language. The approach we called non-linguistic was characterized by an effort to replace natural language by a universal exact and unambiguous representation on which a calculus could be defined. Linguists smile sadly at the new proposals for exact logical representaiions of the meaning of natural language texts. The dream of the ideal language spurred many ambitious attempts over the centuries, since the 17th century, i f not earlier. A l l thc:'se great men with their fantastic systems failed, certainly not f r o m lack of time, zeal, or genius. No modern systems engineer should take it as a personal distrust when his linguist friends tell him to give up as a bad job his design for an exact over-all representation, be it a general-purpose classification of all concepts or something else. The linguists react to proposed ideal languages more or less like physicists do when presented with another proposal for a perpetuum mobile; it is not that they would not like to have one. But there is overwhelming empirical and theoretical evidence that a rigid but yet inclusive language will never be designed: [25] A universal linguistic perpetuum immobile is not possible.

Reasonably, then, [26] The design of exact logical representation of knowledge, except for narrowly restricted highly specialized domains, should be encouraged no more than should perpetuum mobile construction. issue of whether

Hopefully, we need not take any stand on the evasive philosophical [27]

The meaning of an utterance in natural language can in principle be specified in terms of a f i n i t e set of semantic primitives and well-defined functions thereof. I think that [ 2 7 ] is exceedingly implausible, but even those linguists who do will presumably agree that the semantic representation postulated there is much more complex and explicit than any representation which could be for retrieval purposes

Personally support it something considered

175

Hans Karlgren Exact Meaning Representation Difficult. Special purpose codes can, of course, be invented and have been invented f o r particular applications. Thus a retrieval system for chemical substances may work satisfactorily i f the substances are specified with some chemical formula. But as soon as the retrieval questions expected are permitted to refer to unconventional procedures f o r chemical procedures or non-listed families of substances, we risk exceeding the scope of such an exact special-purpose language. Even assuming that a sufficiently inclusive exact representation were f o u n d f o r a given purpose, there are other obstacles to the nonlinguistic approach: [28] It is inconvenient f o r humans to write and read in a f o r m a l language.

For evidence of this statement, it should suffice to refer the reader to his own bitter experience. Even moderately complex formal systems create enormous amounts o f brainpain - - and errors, and that among formally trained persons, too. Even, Boolean expressions get out of hand when the levels rise beyond three or four. A n d professional programmers are haunted by formal errors in programs. This is not a plea f o r using natural language under all circumstances. A r t i f i c i a l languages could and should be improved. To me, it seems evident that the many millenia o f experience built into the structure of natural languages should then be resorted to. The fatal point is that i f a language, artificial or not, has enough "natural" features to be attractive to human users, it is likely to become inexact and ambiguous. Thus, one major convenience feature is what might be called redundancy adaptation: a good margin where mistakes are likely to appear and reduced expressions elsewhere. This human-oriented feature necessarily makes the texts at least locally ambiguous. Similarly, the semantic f l e x i b i l i t y , which is probably vagueness; we shall come back to this point. necessary, is liable to produce

A less obvious obstacle to inducing humans to produce formalized output is that they often f a i l even when their performance is formally correct. Humans users tend to introduce unintentional "natural" features. Stretching the meaning of exact to something like "having the well-defined meaning which can be derived f r o m the specification of the language, and nothing but that meaning", we could even put it [29] An unambiguous and exact man-made text does not exist.

What I am trying to say is that one may well be cheating oneself into believing that R and O were more formally specified than then really are. The writer and any human reader w i l l still read into the text information which, according to the definitions, of the language, is not there, and which w i l l be ignored by the system.*

* l f we look at all such permutations of the statements of a Fortran program as arc equivalent tc the computer, only very few "make sense". On the other hand, quite a few moderately wrong programs do make sense to a human reader, who can correct them without real effort. Similarly, a mathematical proof, say. of Pythagoras' theorem, might still be a valid proof, after shuffling some of the lines, but no reader will be able to see the point.

176

Challenge Paper: Homeosemy—On the Linguistics of Information Retrieval Man-made texts seem to have a text-structure, presuppositional restrictions etc., whatever the a priori norm says about those. A very simple illustration. The Boolean (sub-)expression "A and not B" may well be intended as "A without B". Now the meaning of 'A without B' is quite complex, as appears from the fact that the offer of a document carrying the title diesel engines without injectors is an adequate response to questions such as diesel engines with injectors or even injectors for diesel engines. For whoever wrote about diesel engines "without" injectors, presumably either explained why such deprived engines are adequate or he makes the point that injectors are not, after all, indispensible. In either case he says something important on the role of injectors in diesel engines. The point is not that a retrieval language cannot do without without, but rather that we cannot be so sure we have eliminated it just because there are no other connectors than AND, OR and NOT in the texts. Naturam furca expurgas... Some convenience can be gained by using a less formal input language for R and O and translate it into an operational language. Now, [30] The translation of R and O, if written in a language which has enough of natural features to be attractive to human users, into an exact logical language will cause substantial losses of information. These translation losses will in general be unpredictable by the user, unless the two languages are very close to each other.

[31]

We may note that exhaustive internal recoding of a given more or less informal text is a far more advanced task than the admittedly difficult (mechanical or manual) translation between natural languages, since in the latter case obscurities may (in fact: should) be transferred unresolved to the target text Exact Representation Unnecessary. This may all seem defeatist. A slightly encouraging observation, however, is that there is not necessarily a real need for a complete logical analysis of the requests and offer. We are interested in their mutual agreement rather than in their explication. What we want to do, after all, is not to relate them to a system of general knowledge but to relate them to each other. We need to know how well O approximates R rather than the absolute value of either.

177

Hans Karlgren In numerical work, we know exactly what "approximately" means. "Appr. 1.3" may be defined as "not less than 1.25 and not more and less than 1.35" or as 1.3 plus/minus an error which can be well defined, say 1 standard deviation. We need a theory for qualitative approximation, explaining exactly what it means that an offer is Epproximately what was asked for. In what way is silver an approximation of gold, of money, of chrome, and of photography? We can compare with the mathematical problem of determining the common divisor of two given numbers. We achieve this if we analyse each number by itself and check the two results for common primes. A well-known algorithm, suggested by Euclid, produces the common divisor directly, by successive operations (divisions) on the two given numbers or one of them and the result of the previous operation. We need Euclidean algorithms, operating on the pairs of an R and an O and yielding (a measure of) what is common to them. We need to establish not the meaning of one given expression but the similarity of meaning between two given expressions. The fundamental concept, then, is not meaning, but similarity of meaning. We shall use the word homeosemy (from Greek homojos, almost the same) for the similarity of meaning between two expressions. Exact Representation Insufficient. It is often taken for granted that given exact representations, R' and 0 \ the agreement between these two will be trivial to define (if not to find; long inference chains may present formidable computational problems). Basically, it is assumed that similarity can always be reduced to partial similarity (or to very simple logical relations; cf. supra). We find no real support for such an assumption. [32] If we want to define a topology over the set of expressions each of them, we need not assume that the homeosemy of expression components can be further reduced. Rather, our up from the concept of distance or association between homeosemy"). rather than to explicate any two expressions or analysis should be built primitives ("elementary

Homeosemy, then, becomes a quantitative concept as fundamental as inference or set inclusion. It is in this perspective the work on associative word association should be seen. Consider our attempt at a formal mathematical formulation of association with 'damped transitivity' (Brodda and Karlgren, 1969). I hesitated whether I should put as sub-heading for this section [33] Exact representation is not desirable.

What I meant was that vagueness may be the price that has to be paid in order to achieve the kind of gliding from one concept to another which is necessary for non-trivial retrieval. We note that in natural languages - - and their design is successfu' in this respect — communication normally proceeds without explicit definition of terms. Not only do different persons attach slightly different meanings to the same terms but no person has ever even 10 himself delimited an exact or definable meaning of terms, except possibly for some few of them.

178

Challenge Paper: [34]

Homeosemy—On the Linguistics of I n f o r m a t i o n Retrieval

In n o r n a l human communication, introduction of an explicit d e f i n i t i o n f o r natural language terms is a symptom o f malfunction.

One may ask oneself whether natural language succeeds not in spite o f but thanks to the absence o f rigid d e f i n i t i o n of meaning. The f l e x i b i l i t y of natural language semantics appears also f r o m the observation that [35] definitions o f terms age much faster than the terms themselves.

The Wilkins* 17th century classification o f elementary English terms in a systematic conceptual classification, proceeding f r o m divine, human, animal and so f o r t h , sounds very ancient. But the words themselves remain with approximately the same meaning and actual texts f r o m the same period can be reasonably well understood today by those who know Modern usage. Although we must admit that we f i n d natural language more impressive the more we see o f how it works, all this was not mentioned as an argument against a r t i f i c i a l languages but as a reason to build into any retrieval language some of this f l e x i b i l i t y . For this purpose, [36] meaning differences between terms and meaning shifts should be studied, particularly the meaning shift between question and answer and the shift due to introduction of new terms, in undefined vocabularies. philological knowledge.

Here, we can draw on rich funds o f Effects of Linguistics

So Far. The effects so far have been surprisingly meager, as is made evident by Sparck Jones and Kay and by later publications. The linguistic designs which have been tested have demonstrated little effect over straightforward non-linguistic methods. In some cases the linguistic ingredient seems to have had a negative effect, even in matters of analysis of natural language i n p u t In principle, immediate practical tests of economy or over-all selectivity are not adequate for evaluating new methodology.

[37]

Otherwise a polished p r i m i t i v e system w i l l almost always win over innovations which are less ripe for production. Practical mileage economy tests may be adequate for a Ford and a Volvo but are uninteresting in a comparison of a Volvo and a prototype electric car. Nevertheless, when some documentalists maintain that mere recording of (truncated) terms without respect to word order or any other syntactical i n f o r m a t i o n performs better than linguistic analysis - - and when this happens to be true, with some qualifications - - it is a challenge.

179

Hans Karlgren One reason for poor success is that a kind of linguistic analysis cases: centuries ago) had been given up within linguistics itself documentalist applications. Thus, models equivalent to naive structure grammar have been allowed to represent linguistics [38] which long ago (in some survives or is reborn in dependency or phraseas a science.

These grammars which are nevertheless in some way elementary and fundamental, cannot be used even as a first approximation to grammatical analysis. They may be good as grammar components, but a component cannot replace or approximate the whole.

Thus, an innocent analysis of 'Electronical pedagogical equipment in nursery schools' and of 'Nursery schools with pedagogical equipment' will over-emphasize the differences. It will find that different things are referred to - - gadgets and schools — and different things are stated about them: where they are kept and how they are equipped. That kind of analysis will fail to see that two differences cancel out. The grammatical filter will remove too much in such cases but yet not disclose the similarity of such pairs as 'Finland's export to Sweden' and 'Sweden's import from Finland'. Today. i. Linguistics could immediately be helpful by dissuading documentalists from spending resources on vain attempts, such as creating an ideal language, attacking the general retrieval problem by means of language reduction, trying to make their retrieval language more stable (by means of term definitions or otherwise) instead of more flexible; by supplying professional tools for parsing and similar tasks, means for synonymy manipulation, quantitative association methods for study of associative structures, ad hoc methods for grammatical filtering.

i.i i.ii i.iii

ii. ii.i ii.ii ii.iii ii.iv

Since full-fledged analysis is probably not practical, algorithms must be designed on the basis of the deeper insights about, sa>, transformational relationships. Computational linguistics has lost its innocence and must draw some conclusions therefrom. One conclusion is that since an expression in natural language can under various assumptions yield so many reasonable interpretations, the procedure must take the uncertainty of any analysis into account.

180

Challenge Paper: Homeosemy—On the Linguistics of Information Retrieval In the Future. Linguistics must be involved in the retrieval problems as such. These are linguistic by nature; there is no choice whether or not to treat them in a linguistic manner. Effects on Linguistics More focus will be placed on i. ii. exact study of inexact expressions study of shifts of meaning

iii. study of question-answering iv. semantic topology. So far, emphasis has been placed on the binary distinction between the same and not the same meaning. Of old, linguists have been keen on finding distinctions which were otherwise overlooked. Lately, linguists have established equivalence classes of expressions which have exactly the same meaning, (feeling very unhappy, some of them, when these 'variants' turn out to differ after all, at least in theme/theme relations). Systematic study must be made of the agreement between such expressions as cannot be treated as semantically equivalent References Brodda, B., and Karlgren, K. "Synonyms and Synonyms of Synonyms". SMIL, 1969, 5, 317. Sparck Jones, K., and Kay, M. Linguistics and Information Science. New York, Academic Press, 1973.

181