O 1 AD 738 506 ANNUAL R E P O R T : A U T O M A T I C I N F O R M A T I V E A B S T R A C T I N G AND E X T R A C T I N G L. L . Earl, et a l L o c k h e e d M i s s i l e s and Space C o m p a n y Palo Alto, C a l i f o r n i a February 1972 DISTRIBUTED BY: KfLn National Technical Information Service U. S. DEPARTMENT OF COMMERCE 5285 Port Royal Road, Springfield Va. 22151 This document has been approved for public release and sale. ! ' "*\ STATENS INSTITUT FOR F&RETAGSUTVECKLING /NO 3VA • • • — :>>J,<&' '..1.1 ANNUAL REPORT: AUTOMATIC INFORMATIVE ABSTRACTING AND EXTRACTING LMSC-D246461 February 1972 Annual Progress Report Office of Naval Research Contract N00014-70-C-0239 \Vl MAR M mi Ijjj • "*r*r" TTTTT'""' ... U U-.L-'i i Reproduction in whole or in part is permitted for any purpose of the United States Government Information Sciences Lockheed Palo Alto Research Laboratory LOCKHEED MISSILES & SPACE COMPANY, INC. A Subsidiary of Lockheed Aircraft Corporation Palo Alto, California 94304 DISTIiIBIJTIO^^^TATEMiE:NT A Approved for public release; Distribution Unlimited LMSOD246461 PRECIS RESEARCH PROGRESS REPORT Title: "Annual Report: Automatic Informative Abstracting and Extracting, , f Annual P r o g r e s s Report, Office of Naval R e s e a r c h , Contract N00014-70-C-0239. Authors: L. L. E a r l , O. F i r s c h e i n , and M. A. F i s c h l e r Background: This investigation is concerned with the development of automatic indexing, a b s t r a c t i n g , and extracting s y s t e m s . Basic investigations in English morphology, phonetics, and syntax have been pursued as n e c e s s a r y means to this end. Experimental indexing and extracting s y s t e m s have been developed. At this time, investigation of the use of syntax in indexing and of descriptive representation of pictorial data is continuing. Condensed Report Contents: P a r t I of this r e p o r t d e s c r i b e s a continuing effort in the development of tools for making syntactic and semantic distinctions of potential use in automatic indexing and extracting. One of the tools is a p r o g r a m for syntactic analysis of English; the other is a dictionary of English word government p a t t e r n s . A multilevel p a r s e r PHRASE is described, in which the syntactic s t r u c t u r e is built up in four s t a g e s , with ambiguities at each stage resolved to yield but one s t r u c t u r e for a given sentence. The resultant s t r u c t u r e at each level is designed to be useful in its own right, and also to form the b a s i s for the analyses at the next higher level. The nature of the r u l e s for identifying s t r u c t u r e s and resolving ambiguities is discussed for all four levels of analysis, and examples of the level 1 and 2 a n a l y s e s , which have been implemented, a r e given. The nature of word government is discussed and also its usefulness in making semantic and syntactic distinctions. Appendixes give government tables compiled in the last few y e a r s . P a r t II of the r e p o r t deals with the t h r e e main problems that a r i s e in the storage and r e t r i e v a l of picture descriptions: (1) the acquisition of meaning from natural language d e s c r i p t i o n s , (2) the symbolic representation of the meaning, and (3) the organization of the data b a s e of descriptions to allow efficient r e t r i e v a l of descriptions in response to q u e r i e s . In the natural language a r e a , the problem of ambiguity i s d i s c u s s e d , and a sample p a r s i n g of descriptions using the PHRASE p a r s e r is presented. The conceptual c l a s s e s and picture p r i m i t i v e s to be used in r e p r e s e n t i n g the meaning of d e s c r i p t i o n s a r e treated in some detail, concentrating on natural language e x p r e s s i o n s for " l o c a t i o n . " In the data b a s e organization a r e a , it is noted that the complexity of description can be reduced by providing the system with "world knowledge" concerning r e l a t i o n s h i p s , and a two-dimensional map is suggested for this p u r p o s e . F o r F u r t h e r Information: The complete r e p o r t i s available in the major Navy technical l i b r a r i e s and can be obtained from the Defense Documentation Center. A few copies a r e available for distribution by the a u t h o r s . LOCKHEED PALO ALTO RESEARCH LABORATORY I O C K H E E D A (, R O U P M I S S I I E S OF & S P A C F A I I C I A H C O M P A N Y f . O » P O » * ? I O N D I V I S I O N l O C K H I I O LMSC-D246461 FOREWORD This report marks the completion of the eighth year in which the Office of Naval Research has contributed support to research in the Information Sciences at the Lockheed Palo Alto Research Laboratory of the Lockheed Missiles & Space Company, Inc. During the first year of the program, a major part of the effort went into establishment of a word-data base. The English Work Speculum, which has been distributed to ONR program participants, illustrates the nature of this data base. In the second and third years, this data base was exploited in the development of a computer program for the automatic assignment of parts of speech to English words. Also during these years, it was demonstrated how an English/Russian phrase data base can be used to develop a technique for obtaining English indexes from untranslated Russian text. In the third and fourth years, a new data base of sentences with assigned parts of speech was created for investigation of the abstracting and extracting process. Also begun during the third and fourth years were experiments in the compilation of a "sentence dictionary" of syntactic types and compilation of English syntactic word government tables. These activities were continued in the fifth year, along with development of a parsing program, the initiation of some extracting experiments on some technical text, and an experiment in automatic indexing of a medical book. In the sixth year, the "sentence dictionary" experiment was concluded, the extracting experiment was completed, a frequency-syntax method of indexing was conceived and tested, and the concept of English syntactic word government was expanded while compilation of the tables continued. In the seventh yeai; compilation of the word government tables was temporarily halted while effort was concentrated in two main areas. First, the scope of the parsing program was extended, preparatory to eventual additional indexing experiments using v LOCKHEED PALO ALTO RESEARCH LABORATORY LMSOD246461 syntax in conjunction with frequency and word government criteria. Second, a study in describing and abstracting pictorial structures was undertaken. This year, the extensions to the parsing program were completed and tested, and a plan for a complete four-level parsing system was conceived and described, with the level of descriptive detail differing, of course, according to the extent of current implementation. Also during the year, compilation of English syntactic word government was resumed, in a somewhat augmented form. Finally, a series of experiments involving human subjects describing aerial photographs was completed and the results analyzed, particularly for the n metadescriptive n information in the descriptions and for derivation of canonical forms that can be used to represent the content of the descriptions. Part I of this report is concerned with the development and uses of the syntactic analyzer and with the concept of English word government. Part II describes the investigations in describing and abstracting pictorial structures. The group at Lockheed takes this opportunity to express their thanks for the continuing support and encouragement given by the Information Sciences Branch of the Office of Naval Research. VI LOCKHEED PALO ALTO RESEARCH LABORATORY L O C K H E E D A G l O U r M I S S I L E S OF 8. S P A C E AIRCRAFT C O M P A N Y CORPORATION D I V I S I O N LOCKHEED LMSC-D246461 CONTENTS Section PART I 1 EXPERIMENTS IN THE USE OF SYNTACTIC INFORMATION IN AUTOMATIC EXTRACTING AND INDEXING THE SYNTACTIC ANALYZER "PHRASE" 1.1 1.2 Background: Previous Experiments in the Use of Syntactic Analysis in Automatic Indexing and Extracting Theory and Methodology 1.2.1 1.2.2 1.2.3 1.3 1.3.1 1.3.2 2.1 2. 2 3.1 3.2 3.3 3.4 Page 1-1 1-1 1-4 1-4 1-7 1-13 1-29 1-29 1-35 2-1 2-1 2-7 3-1 3-1 Overview Summary of the Four Levels Methods of Ambiguity Resolution at Each Level Levels 1 and 2 of PHRASE Level 3 of PHRASE Progress and Results ENGLISH WORD GOVERNMENT Nature of Word Government Utilization of Word Government in Syntactic Analysis B P H R A S - Level 1 of PHRASE Parser NESTPH - Level 2 of PHRASE Parser Output Program - For Level 1 and 2 of PHRASE CJPHAS Flow Diagram — A Preliminary or Working Draft for Level 3 of PHRASE Word Government Tables DOCUMENTATION 3-22 3-43 3-57 3-61 4-1 3. 5 REFERENCES PART H 1 2 DESCRIBING AND ABSTRACTING PICTORIAL DATA INTRODUCTION REPRESENTATION OF MEANING 2.1 Conceptual Classes for Picture Description 1-1 2-1 2-1 vn LOCKHEED PALO ALTO RESEARCH LABORATORY « O r . r M * r o LMSC-D246461 Section 2.1.1 2.1.2 2.2 2.2.1 2.2.2 2.2.3 2.2.4 2.2.5 2.2.6 2.3 2.3.1 2.3.2 Case Relations The Remaining Conceptual C l a s s e s Attribute P r i m i t i v e s Location P r i m i t i v e s Localizing P r i m i t i v e s Qualifying P r i m i t i v e s Classificational P r i m i t i v e s Operative P r i m i t i v e s The Metanet Relationship P r o p e r t i e s Page 2-5 2-5 2-9 2-9 2-10 2-11 2-12 2-12 2-12 2-13 2-13 2-13 Picture Primitives Structuring the Data Base 3 NATURAL LANGUAGE ASPECTS OF CONCEPTUAL MAPPING 3.1 3.2 Semantic Ambiguity P a r s i n g Pictorial Descriptions 3-1 3-1 3-3 4-1 4 Appendix A B REFERENCES THE USE OF CASE STRUCTURES IN SEMANTIC MAPPING DESCRIPTIVE REPRESENTATIONS OF REMOTELY SENSED IMAGE DATA A-l B-l Vlll LOCKHEED PALO ALTO RESEARCH LABORATORY L O C K H E E D A CROUP M I S S I L E S OF & S P A C E AIRCRAFT C O M P A N Y CORPORATION D I V I S I O N cOCKHEED LMSC-D246461 ILLUSTRATIONS Figure PARTI 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Sample of Limited Parsing Four Levels of Analysis Relationship of Basic Phrases to Binary Tree Representation Examples of Participle Usages Sample of Grammatical Structure After Level 3 Analysis Resolution of Preposition-Conjunction Ambiguity Example of Operation of Juxtaposition Rules Example of All Possible Noun Phrases and Verb Phrases Portion of Noun-Verb Ambiguity Logic Example of Noun-Verb Ambiguity Resolution Examples of Higher Level Phrases Sample of Parsing Output Example of Error Correction at Level 4 (Sentence 58) Sample of Processed Text BPHRAS Flow Diagram NESTPH Flow Diagram - Conceptual NESTPH Flow Diagram - Detail OUTPUT Skeleton Flow Diagram OUTPUT Flow Diagram - Detail Page 1-2 1-6 1-9 1-11 1-12 1-14 1-16 1-19 1-20 1-23 1-26 1-31 1-36 1-37 3-2 3-27 3-32 3-47 3-50 PART II 2-1 3-1 Example of a Conceptual Net for a Picture Description Description Parsings Using BPHRAS 2-3 3-5 LOCKHEED PALO ALTO RESEARCH LABORATORY LMSC-D246461 TABLES Table PART I 2-1 Full Government Table for the Verbal Uses of Hand PART II 1-1 2-1 2-2 2-3 2-4 2-5 3-1 A-l A-2 A-3 Typical Descriptions for t h e Various Earth R e s o u r c e s Disciplines Basic Concept C l a s s e s of a Description Case Relations Used in Picture Description E x p r e s s i o n s for Location Attributes of Objects P r o p e r t i e s of Relations Concerned With Locations E n t r i e s for Letter A F r o m Government Tables Pertinent to P i c t u r e Description Cases Used by Fillmore C a s e s Used by Schank Cases Used by T e s l e r f, n Page 2-9 1-2 2-2 2-6 2-7 2-8 2-15 3-4 A-2 A-4 A-4 LOCKHEED PALO ALTO RESEARCH LABORATORY L O C K H E E D A GROUP M I S S I L E S OF & S P A C E AIRCRAFT C O M P A N Y CORPORATION D I V I S I O N LOCKHEED