VI-I VI. T e m p l a t e A n a l y s i s and its A p p l i c a t i o n to Natural Language Processing S tephen F. Weis s Abs trac t The basic first section of and this paper the introduces basics the of prethree types of templates theoretical template sents analysis. Building on t h i s , the second section some practical considerations analysis needed for the implementathe of tion of a general concept template uses of template s c h e m e , including template keywords. An actual in Section implementation 3 . of The analysis is presented for scheme template analysis the extraction text. Basic bibliographic phrases from natural language implementation presented. techniques as well as experimental results are 1. The Basics A) of Template Analysis Introduction analysis is a technique for partial semantic Template analysis natural of of natural language language. string It consists against of matching a set words in input a predetermined of one or more will become templates. A template is a string or special symbols. that Its exact meaning If a match clear a the examples template and follow. occurs between complete some part of the input, a specified set of VI-2 actions change data. single is p e r f o r m e d . the input string These may involve direct effects which or side effects which work on terminate after the discovery other of a it con- The process may template match as in ELIZA [ 3 ] , but more often in the input have performed. tinues until all located and Each template m a t c h e s actions been the associated template is a string, usually made up of very the framework of a natural little language, meaning. words which common w o r d s , which phrase. They may fit The forms template w o r d s generally convey d o , h o w e v e r , create a context fit and into the thus form a m e a n i n g f u l template and bear into which other phrase. These words of the crux of the meaning environment. the phrase are called the template of phrase template the template The words type ir thus indicate the existence of a specific The as well as qualifying gives the e n v i r o n m e n t . to that environment specific meaning phrase. First, gives T e m p l a t e analysis the input is basically with a two step p r o c e s s . set. is compared the template A match an i n d i c a t i o n Second, that a certain type of phrase is p r e s e n t . of the of input the ex- the environment This either of the matched provides segment is c h e c k e d . the exact meaning phrase or rejects amples serve the original a n a l y s i s . the meaning template: The following to clarify the of templates and template analysis. Consider VI-3 BETWEEN The zeros indicate 0_ AND () one and only one word may " h o l e s " where occur. This template matches phrases A, B and the holes C in Figure 1 the wrong It will not m a t c h D or E because number of w o r d s . contain A. B. C. D. E. BETWEEN A AND Z BETWEEN HERE AND THERE BETWEEN 1950 AND 1965 BETWEEN AND BEYOND BETWEEN LAST YEAR AND NOW Sample input phrases Figure 1 A match of this template with an input To verify string this provides it remains an only in indication of a date p h r a s e . if the words to d e t e r m i n e this case If such in the template environment, the two words in the h o l e s , are date in C, appropriate action indicators. is taken in is the case as such as passing A and The the dates to a search program. f a i l s , the phrase If as is B the environment check check rejected. by the environment can sometimes template. template: be eliminated the use of a more version of complete Consider modified the previous BETWEEN 19//// AND 19## VI-4 The pound the inputs further signs indicate the presence of any d i g i t . only C. Of No and in Figure of 1, this template m a t c h e s is needed testing the environment can be taken in this case appropriate actions immediately. B) Types of templates types of t e m p l a t e s . required They are There are distinguished in the three basic by the processing template and results achieved two step 1. analysis Under all process. certain circumstances the information the to Certainty (CI). template match determine phase provides required test the is the meaning of a p h r a s e . The environment is thus not n e e d e d . second of An example of such a template examples: the above BETWEEN 19//// AND 19//// The template is complete w i t h i n itself and the environment in usable to test may be o m i t t e d . analysis is expected The While a high degree of accuracy from them, CI templates of templates are not required in p r a c t i c e . provide with large number any degree of generality task of creating in allowable i n p u t s , along CI the huge the template l i s t , makes templates 2. of impractical for all but a few c a s e s . (C2-C1). by 0 the The second following: type Conjecture-Certainty t e m p l a t e , the C 2 - C 1 , is typified 0 AND BETWEEN VI-5 where the zeros match any single word. In this case a type of template m a t c h phrase. This 1. is not a guarantee example of a particular matches in particular thus only inputs A, B, and the to existence C in Figure The system phrase. conjectures is subject of a specific rejection This decision possible template depending on further examination of the of environment. Environment words testing may consist determining if the environment Or it may table. produces consist are of a specified environment type or t y p e s . of testing words against a match by In all cases of C2-C1 a c o n j e c t u r e which test, hence final t e m p l a t e s , the template is then completely the name resolved the environment Because nearly of this "conjecture-certainty". templates provide certainty, the C2-C1 the same high degree of accuracy The analysis complex as do the CI matches, needed C2-C1 be is greatly phrase templates. however, of the C2-C1 template due to the extra is more testing of the to p r o v i d e templates specified. eased type small and cortainty. is that only The job the number The prime advantage the frame of a phrase need the template required set of determing of templates for each is significantly system reduced. for As an e x a m p l e , in study, journal the phrase and a alone implemented this detection journal is accomplished using 18 C2-C1 templates list. To do the same job with CI templates VI-6 requires Savings 450 t e m p l a t e s , each in search effort longer than the present 18. t i m e , storage are space as well as template preparation 3. template only ture ment case substantial. (C2-C2). The third type of of Conjecture-Conjecture similar is quite to the C2-C1 in that yields it consists the frame of a p h r a s e , and a match that a phrase test exists. after Also only a c o n j e c environin the is a seC2-C2 this like the C 2 - C 1 , an is found. is performed the match But the environment conjecture. probably test merely The result weakens is not or supports c e r t a i n , but the original cond, and b e t t e r , conjecture and hence of the designation. It is apparent that because these uncertainty cannot be in the final r e s u l t , analyses with expected The C2-C2 to be as accurate templates as those of the CI or C2-C1 whenever the types. templates are necessitated too large environment use is the for a phrase becomes of a look-up or varied testing for practical rules. Such table or environment case of author of all author as : phrases w h e r e a list of all possible names is prohibitively large. forms such A template WRITTEN BY may give an indication of a possible author p h r a s e ; and (see is there exist 3, part certain rules which may help C). But the final decision the decision of phrase Section still type VI-V only an educated guess. thus T h i s , h o w e v e r , is better the C2-C2 templates than no guess at a l l , and serve a useful purpose when no other method From creation and reason study the standpoint u s e , the C2-C1 is a p p l i c a b l e . of generality templates and ease of For this the three this seem b e s t . used in the vast majority are of this type. of the templates Figure 2 below summarizes types of templates. Template type CI C2-C1 C2-C2 Decision template c ertain after match Decision after environment test no t needed certain improved conjecture conj ec ture conj ec ture Summary of template types Figure 2 C) Applicability of template analysis requires an a priori . that is tem- The use of template analysis knowledge of every type of natural is specified language input to be a n a l y z e d . plates and themselves This in the form of the as well as e n v i r o n m e n t a l If the input form testing rules or used. look-up tables. is quite varied cannot be unpredictable, In such and then clearly template and analysis cases more general analysis complex forms of syntactic some semantic are required. But there are VI-8 applications for template in w h i c h analysis. While the input structure meets the criteria from One such a p p l i c a t i o n and syntactic is input a console. that the semantic structures without called his the average person the language can understand are almost uses, bound, that a person actually ±s a performance grammar, language. small subset of his understandable when extra This p e r f o r m a n c e type rather required tend set is further restricted The the user must time and sentence than speak his input. to compose and type effort complicated When the user structure to discourage is added such a c t i o n s . each input the further is limited become restriction to a specific that for set of input t y p e s , the inputs use of the sufficiently predictable to permit effective template analysis. uses of tat ion. The remainder and of this paper explores template analysis describes a working implemen- 2. An I m p l e m e n t a t i o n Analysis The of Natural Language A n a l y s i s by Template theoretical basis of template analysis has section. The present used in the section been ex- presented plores in the previous the practical considerations analysis implementation of an e x p e r i m e n t a l template system. VI-(J A) Keyword analysis string against a template set can of Searching often be a very templates an input inefficient process. input The vast majority s t r i n g , and do not have templates of do not match a given these nonmatching word t e m p l a t e s , the majority the input. even a single thus have it would matching templates by one in common with These no p o s s i b l e chance of matching to exclude the given i n p u t , and be a d v a n t a g e o u s phrase. For them from the template of keyword this reason This the technique is introduced. Weizenbaum of is similar [3]. The to the scheme used in ELIZA in each technique entails assigning As the the words template into to be the keyword. those with all the templates keywords have been tains of The are entered are stored the system, After identical in a g r o u p . list templates The list con- entered, a keyword for last is created. the keyword and each g r o u p , and template for the storage location group. of the in group the first that particular template matching list. process begins with a search keyword keyword the If the particular is contained template i n p u t , then each member of the associated is searched contain is for a possible m a t c h . keyword, If the input does not set the particular the associated scan of the template ignored. group Thus with a single of obviously i n p u t , an is eliminated entire nonmatching templates VI-10 from the set which must be searched. templates is quite The savings in search savings time by keyword are word word other further substantial. is the least These likely enhanced if the keyword "Least in the t e m p l a t e . in the likely w o r d " least is defined in as that template w h i c h of occurs often contexts the than that the given template p h r a s e . Consider template: DURING O i TO < 0 Clearly the use of "DURING" as the keyword attempts than would leads to fewer Thus, u n s u c c e s s f u l match by eliminating the use of " T O " . for a those templates which have no chance template match w i t h a given streamlines process. The keyword function; Without decision input, the keyword and fairly facility greatly the complex lengthy template matching template technique has a second useful scheme. it permits a middle-out template matching algorithms solution k e y w o r d s , template matching of where to s t a r t . require a is to try of the is The only every word template. tested long in the input as a possible If a match first word of o c c u r s , the next word template w o r d , and the input For against the second so on. input strings this can be a time consuming templates process. This is especially true for nonmatching since rejection VI-11 cannot occur until scheme every input word those has been tested. Since have keyword) the keyword tests only templates which at least one word this common word search. For in common with provides the input (i.e., the the of key an ideal place to start inputs with two or more occurrences as the principal to the left of the the and same k e y w o r d , one is designated searching in the begins there. Words keyword of template are tested against words likewise to the left the keyword hence this in the i n p u t , and for the r i g h t ; and in the m i d d l e - o u t w a r d study have strategy. Most templates used the first in one or their keyword as either last element, only. so that most searching is actually this direction is that But the important seed, that thing about technique for the the search is uniquely nonmatches than after advantage taining is the starting point search, of determined. to occur after exhaustion This also permits only a very few rejection tests rather A further con- of the entire process input string. of the keyword involves the inputs a given the more than one substring which matches searches of the of template. left A normal matching process the first input to right and picks out substrings. designer T h i s , h o w e v e r , may not be the intent who in fact may want the system the second, or possible rightmost,, s u b - VI-12 p h r a s e analyzed first. The solution is easy using keywords. con- Inputs with m u l t i p l e m a t c h e s tain m u l t i p l e By simply occurrences of the same template must for that of the keyword keyword template. principal designating the desired as the key, the proper Thus substring is analyzed first. in template matching which the keyword technique helps on two l e v e l s . must be tested a specific F i r s t , it narrows against the set of templates second it the i n p u t , and provides for the start point in both the template and the matching input of the s e a r c h . faster and m o r e The similar This permits process to run e f f i c i e n t l y , as required for on-line u s e . in a manner list is template matching process proceeds [13]. to that of a Markov algorithm The template the input out. The list. is scanned found. process in top down order until a match with actions are then carried The associated is then restarted at the top of the template removal from the input repeated Of course of matched of that this n e c e s s i t a t e s substrings string matching the type in the in order to prevent substring. This is generally done by replacing to the substring with a n o n t e r m i n a l symbol a p p r o p r i a t e of template m a t c h e d . subsequent end This n o n t e r m i n a l may The process then be used template m a t c h e s . terminates when finding of the template list is reached without a match. VI-13 The a d v a n t a g e specific of the Markov approach and is that ordering it enables a of ana lysis a desired to hierarchy of templates be imposed template in the system, as well as allowing to be maintained. grouping B) The Implementation templates shown conventions in the previous examples are While practical require space are quite m a d e up of words and these are easy for an actual large amounts needed complex are represented and alphabetically. they are not templates amount of to read understand, implementation. Alphabetic of storage with a v a r i a b l e Second, alphabetic And for each w o r d . and searches time consuming. it necessary individually. third, the use of a l p h a templates be betics makes represented all have that For even synonymous example the templates actions. below the same meaning and associated WRITTEN AUTHORED BY BY BY PRESENTED The use of alphabetic be in the template templates would From actual in fact many template require that all it three set. experimentation such synonymous appears and that there are large templates, thus a very set can r e s u l t . Clearly a VI-14 shorthand problems template r e p r e s e n t a t i o n which overcomes For this reason numerical The conversion is achieved these and is desired. templates to its input v e c t o r s are used. associated template composed, numerical from a word through a concept special is is dictionary. for The dictionary is quite small and type w o r d s , that the most p a r t , of functor conjunctions, prepositions Words not "999" found and a few other very are assigned common w o r d s . the concept used 22 in the dictionary that to indicate they are u n k n o w n . contains The dictionary into in this concept implementation numbers. 140 words which map is The problem of synonymous words the same concept the same word solved by assigning The numeric the natural natural from to the desired words number. as input vector m a i n t a i n s language input. order The example below shows the numeric vector the treatment that two results and language inputs and Note either of them. of synonyms unknowns. Input A: Input B: PAPERS WRITTEN BY LESK ON BY WEISS CLUSTERING CONCERNING A R T I C L E S AUTHORED TEMPLATES Vector: 27 15 Concept 12 999 999 999 Sample natural language and numeric input strings Figure 3 VI-15 Templates addition following templates are also stored in numeric form with in the the of some special paragraphs. and stored conventions explained of this In the remainder input strings study to be all numeric. are assumed Occasionally, presented h o w e v e r , some alphabetic of clarity. examples may be for purposes The use of clearly The numeric solves the representation previously reduction number of inputs and synonymy templates problem. mentioned resultant the 44 in the size of the template set varies with of synonymous w o r d s , in the present are required instead of more than implementation two thousand. easier. amount and templates Numeric templates also make storage and searching U n l i k e an alphabetic of s t o r a g e . faster string, a number of numbers of variable uses a fixed easier Comparison is also than comparison length alphabetic provides and a more more strings. compact In general data the use of numeric and vectors representation facilitates faster efficient processing. numerically in a similar into Templates are stored manner to that of input in alphabetic strings. form and They may be entered the system run through a dictionary in numeric form. The form 4. 5, 6 l o o k - u p , or they may latter is used in this is given be entered implementation. The general of a template The reason for in the BNF definition of the digits in Figure in lines the d i s t i n c t i o n VI-16 and 7 will b e c o m e clear in what follows. 1. 2. 3. 4. T e m p l a t e ::= part>