^ 4.1 T a b L e s Tntroduction a n d cJicz-fc x o r i a r dLes Automatic cross-reference tables can be used in both spelling correction and recall improvement. They are particularly useful for automatically linking synonyms, abbreviations, alternative spellings and other related terms to their equivalents. 4.2 Methods and techniques 4.2.1 Dictionaries in spelling correction Pollock and Zamora L13 used a dictionary for spelling error detection in the 5PEEDC0P project, although this technique is only one of several. 5PEEDC0P uses a dictionary of common misspellings as a supplement to a similarity key/ reverse error algorithm. The common misspelling approach is based on the assumption that spelling errors which have occurred in the past will occur again. The incorrect form can be mapped to the correct spelling Cfor example, "teh" might be mapped to "the"}. There are several limitations inherent in this approach. Firstly, most spelling errors do not recur at all frequently. Secondly, the list is always incomplete as new misspellings will constantly occur. Thirdly, up to 15% of misspellings are ambiguous Cfor example "hoise" could be a miskeying of "noise", "house" or "hoist"3. Experiments conducted by the 5PEEDC0P team indicate that the effectiveness of a dictionary of common misspellings depends on the size of the sample from which it is created; a dictionary generated from a sample of about ten million text words will probably correct about 10% of misspellings. Rlthough its application is limited it is almost entirely accurate provided that it only contains unambiguous misspellings. The 5PEEDC0P project also used a general dictionary of around 40,000 correct terms. This is generally adequate even for a technical database as most of the text of a technical database does not consist of specialised vocabulary CPollock and Zamora point out that 3 0 % of Chemical Abstracts consists of function words but only 2 % of chemical substance words). Dictionary look-up does have limitations; a comprehensive dictionary can recognise a Large proportion of the text but -33- 4 Tables and dictionaries the number of words unmatched wilt still be large when the volume of words processed is in millions. The situation is not improved simply by increasing the size of the dictionary. The proportion of words recognised will certainly increase but some misspellings are Likely to be identified as •correct". For example, "ion - will probably be a high frequency word in a chemical database, but is more Likely to be a miskeying of "icon 1 in a decorative arts database. "Ion" should not be included in a decorative arts dictionary as it is more likely to be wrong than right. 0 small special dictionary of specialised vocabulary can be used to supplement the standard dictionary of common words. In most applications this specialised vocabulary will be disproportionately represented in the flagged words. Pollock and Zamora have written that when the Chemical Abstracts dictionary was applied to Chemical Industry Notes Ca more specialised database dealing with the chemical industry} more than 9 8 % of the words flagged as possible misspellings were valid. Most of these words were eliminated by a small specialised dictionary. The 5PEEDC0P procedures include a suffix normalisation technique similar to that used by Galli and Yamada IZ'i , which serves to increase the capacity of the dictionary for matching terms without increasing its size. If a word is not in the dictionary then the suffix algorithm would stem the word to its root and look for this in the dictionary. 01 though stemming can reduce the number of terms in the dictionary Cby 15% in this instance} this saving in storage is offset by the computer time taken to identify the variants. The 5PEEDC0P algorithm attempts to bypass specialised classes of words, such as acronyms, trivial and trade terms for substances, systematic chemical nomenclature or proper names. Words which fall into these categories can often be treated by incorporating a document-level frequency threshold. More specifically, acronyms and systematic nomenclature can be recognised algorithmically. Acronyms can be detected with reasonable reliability if they are in upper-case letters. 4.2.2 Proof-reading methods Galli and Yamada [23 describe an automatic dictionary which has been used for checking machine readable text in proofreading. The dictionary verifies every text word and produces an output document in which all the words are hyphenated if necessary, corrected if misspelt and standardised in the case of spelling variations. British spellings are transformed into American spellings and nonpreferred spellings are transformed into preferred forms. The dictionary contains about 56,000 entries including word stems, word endings, whole words, prefix-combining forms, 4 JabLes and dictionaries suffix-combining forms, spelling standardisation entries, spelling-error correction entries and control entries. The size of the dictionary was reduced by including the prefix and stem information. Unlike the uses of stemming discussed in Chapter 3 stemming is here used as a compaction technique and not as a recall improvement device. The dictionary is used to identify compound words; some endings are listed with a code which allows tentative compounding and "then invokes corrective measures at a later stage. The dictionary contains a list of almost 2000 words which can be spelt in different ways. In this way, a nonpreferred spelling is transformed into the preferred form C"moveable" is altered to "movable"D. The system also transforms British English into American English. Correct spellings of archaic words which could be misspellings of more common words are not altered but are flagged CGalli and Yamada give the example of "calender", the archaic spelling of "calendar"3. Equally acceptable variants are both Listed C"sirup" and "syrup"). The system lists about 2000 common misspellings which are mapped to the correct form Cand flagged as corrected for possible manual checking later 3. 0 test of the system verified 89% of the documents. 4.2.3 Spelling correction using a ciictionary together with a word representation technique The designers of LEXICON [3, 4J suggest that a good error correction method would be a two-step process consisting of a moderately low threshold modified soundex system followed by a high threshold similarity check. Tests revealed that this combination should automatically correct 60-70% of errors. The medical free text system at Massachusetts General Hospital C53 uses a dictionary as a pre-stage to soundex spelling correction. The dictionary includes some variant spellings, terms which are CmedicallyD non-preferred Csuch as "womb"D and expletives. The latter are ignored by the computer - they are in effect on a stop List - in order to discourage their use. 4.2.4 Using tables to match related words Possibly the greatest utility of tables lies in their matching potential for semanticalLy related words, as stemming and word representation devices are only useful for words which are morphologically similar. 5ome words which are similar in meaning are also orthographicalLy similar, but many are not. Even when they are orthographically related, a dictionary Link will ensure a unique match whereas a Link through a stemming algorithm might include other irrelevant words. -3b- 4 TabLes and dictionaries Medical information retrieval systems seem particularly well-suited to the application of these techniques. Wong and others [43 have described a natural language dictionary CLEXIC0N3 of anatomic pathology. Text is scanned and separated into keyword types: authoritative types, nonpreferred keywords and optional synonyms. LEXICON is not organised as a hierarchical thesaurus with complicated cross-linking. More than half of the words C52.2%3 are cross-referenced to a supplementary word. These supplementary words fall into several categories: a preferred synonym or alternative spelling C"edema "oedema•3; * an optional synonym C"jaundice" for "icterus"3; a disinflection C"kidney" for "kidneys"3; a related word in a hierarchical scale C"enteritis" for "enterocolitis"3; a medical term for a lay term C"pregnancy" for "gestation"3. Doszkocs [63 describes HID, an "Rssociative Interactive Dictionary", for automatically generating and displaying related terms, synonyms, and broader/narrower terms. CITE C2.53 uses a table look-up procedure. This table, for example, maps "treat", "treatment", "treating", "therapy", "therapies", "therapeutic", ^care" and "regimen" to the subheading "therapy" [73. Other suggestions include the automatic mapping of author entries to the National Library of Medicine name authority file and the use of an automatically generated thesaurus. 4.2.5 Linking naturaL Language terms Language terms with controLLed for Tables have been used in medical information systems as a means of linking natural Language search terms to controlled Language Me5H headings. Doszkocs has written that this dependency on MeSH is particularly pronounced in Medline since less than 5 0 % of the records contain abstracts [83. If a retrieval system is to work effectively, then it is essential to develop links from the search terms to the appropriate subject headings. This can be done in several ways; one of the simplest ways is to use a dictionary for automatically mapping search terms to potentially useful subject headings. In some disciplines this dictionary would need to be constructed manually. In the "hard" sciences there are often existing thesauri. Doszkocs considered identifying chemical synonyms by matching query terms against the National Library of Medicine's chemical dictionary file, CHEMLINE. RLthough Doszkocs is as yet uncertain of the best method of attempting synonym control, -^R- 4 Tables and dictionaries the current CITE does achieve a considerable degree of semantic expansion by using tables to map search terms to the Me5H file. 4.2.6 Compound words and homographs The M0RPH5 system C9, 10, 11] developed by Bell and Jones at the Malaysian Rubber Producers' Research Association uses lists in its treatment of compound words and homographs . The presence of compound words in textual information can substantially influence recall. R compound word or a phrase which has the same meaning can exist in several different forms: for example, "houseboat 1 , "house boat", "house-boat", or even "a house which is a boat". These all have the same meaning but if the system makes no attempt to deal with this problem, the different forms will not be brought together in the index. The only totally automated method of separating compound words would necessitate checking all multi-syllable words against the system vocabulary in order to ensure that they do not contain embedded fracturable elements. R list would still be needed to detect compound words which are neither concatenated nor hyphenated. Bell and Jones discuss the use of what they call Links as a solution Calbeit partialD to the compound word problem. Using links, the compound "fountain pen 1 could be entered into an inverted index with a P-Link added to the term "fountain" and an F-link added to the term "pen". R system of links has been incorporated into a recent version of MORPHS. M0RPH5 also holds a list of homographs, such as "china". R user whose search contains "china" might be asked to choose between "ceramics" and "People's Republic". 4.2.7 Stop lists Practically all information retrieval systems use stop lists of words which do not enter the index or which are automatically removed from searches submitted to the system. Stop lists range in size from a handful of words to many hundreds. Different lists may be used for different data fields or search types. 4.3 The use of tables in online catalogues Several commercially available systems Csee 2.4.23 allow user libraries to set up lists of groups of terms which will be treated as equivalent during indexing and searching. MELVYL uses an abbreviation table C123. CITE's mapping of users' terms to Me5H headings has been mentioned above C4.2.53. Some of the phrase search systems can use authority files to map searches semi-automatically to preferred forms. -37- 4 Tables and dictionaries Rll the more recent systems allow the use of stop lists, although there may still be one or two "phrase searching" catalogues where initial definite and indefinite articles are not automatically removed from user input. Rn intermediate version of Okapi C1.5D contained a single table which was used both during indexing and during searching. Entries in this table were of three types: stop words and phrases lists of terms to be treated as equivalent children) compound words or "go phrases- (industrial CchiLd, revolution') It will be seen in Chapter 6 that the Okapi '86 EXP system uses more or less the same scheme. The most obvious difference in Okapi '86 is that everything Capart from some stop word processing} is done within the index to avoid the storage overhead of a Large table when the search programs are running. References 1 POLLOCK J J and ZRMORR R. System design for detection and correction of spelling errors in scientific and scholarly text. Journal of the Rmerican Society for Information Science 35 C23, 1984, 104-103. 2 GRLLI E J and YRMRDH H. Rn automatic dictionary and the verification of machine-readable text. IBM Systems JournaL 6 C33, 1367, 132-207. 3 JOSEPH D M and WONG R L. Correction of misspellings and typographical errors in a free-text medical English information storage and retrieval system. Methods of Information in Medicine 18 C43, 1373, 228-234. 4 WONG R L and others. Profile of a dictionary compiled from scanning over one million words of surgical pathology narrative text. Computers and Biomedical Research 13, 1380, 382-338. 5 FENICHEL R R and BRRNETT G 0. Rn application-independent subsystem for free-text analysis. Computers and Biomedical Research 9, 1376, 153-167. 6 DOSZKOCS T E. RID : an Rssociative Interactive Dictionary for online searching. Online Review 2 C23, 1378, 163-173. -38- 4 fables and dictionaries 7 D05ZK0C5 T E. CITE NLM : natural-Language searching in an online catalog. Information Technology and Libraries 2 C4D, December 1383, 364-380. 8 D05ZK0C5 T E. From research to application : the CITE natural language information retrieval system. In: Research ceedings and Development in Information Retrieval. ProBerlin 1 9 8 2 . E d i t e d by G e r a r d Sal ton and H a n s - Jochen Schneider. Berlin : Springer-Verlag, 1983, 2S1262. 3 BELL C L M and JONES K P. R minicomputer retrieval system with automatic root finding and roling facilities. Program 10 C13, Jan 1376, 14-27. 10 BELL C L M and JONES K P. Back-of-the-book indexing : a case for the application of artificial intelligence. In: Informatics an RslxblBCS 5. The analysis of meaning. Proceedings of Conference. Oxford, 1979. L o n d o n : R s l i b , 1373. 155-61. 11 BELL C L M and JONES K P. The development of a highly interactive searching technique for M0RPH5 CMinicomputer Operated Retrieval CPartially HeuristicD System. Information Processing and Management 16, 1380, 37-47. 12 UNIVERSITY OF CRLIFORNIR. OFFICE OF LIBRRRY RUTOMRTION. MELVYL reference manual : University of California Online 1385. Catalog. Berkeley : University of California, -33-