II-l II. A Scatter Storage Scheme For Dictionary Lookups D. M. Murray 1# Introduction A document retrieval system must have some means of recording the subject matter of each document in its data base. Some systems store the actual text words, while others store keywords or similar content indicators. The SMART system [1J uses concept numbers for this purpose, each number Two advantages are indicating that a certain word appears in the document. apparent. element. were used. First, a concept number can be held in a fixed sized storage This produces faster processing than if variable sized keywords Second, the amount of storage required to hold a concept number Hence, storage space is used is less than that needed for most text words. more efficiently. SMART must be able to find the concept numbers for the words in any document or query. This is done by a dictionary lookup. There are two reasons why the lookup must be rapid. For text lookups, a slow scheme Is For handling costly because of the large number of words to be processed. user queries in an on-line system, a slow lookup adds to the user response time. Storage space is also an important consideration. Even for moderate sized subject areas the dictionary can become quite large — too large for computer main memory, or so large that the operation of the rest of the retrieval system is penalized. In most cases a certain amount of core storage is allotted to the dictionary, and the lookup scheme must do the II-2 best possible job within this allotment. This usually means keeping the overhead for the scheme as low as possible so that a large portion of the allotted core is available to hold dictionary words. The rest of the dic- tionary is placed in auxiliary storage and parts of it are brought in as needed. Obviously the number of accesses to auxiliary storage must be minimized. This paper represents a study of scatter storage schemes for application to dictionary lookup. servative with storage. schemes in general. These methods appear to be fast and yet con- The next two sections describe scatter storage The fourth section presents the results of various The fifth section discusses the The final sections deal with experiments with hash coding algorithms. design and use of a practical lookup scheme. extensions and conclusions. 2. Basic Scatter Storage A) Method A basic scatter storage scheme consists of a transformation algorithm and a table. The table serves as the dictionary and is constructed as follows. Given a natural language word, the algorithm operates on its bit pattern to produce an address, and the concept number for the word is placed in the table slot indicated by this address. be placed in the dictionary. and the table, a hash table. There are many possible algorithms for producing hash addresses. [2,3,4] Some of the most common are: This process is repeated for every word to The generated addresses are called hash addresses; 1) choosing bits from the square of the integer represented by II-3 the input word; 2) cutting the bit pattern into pieces and adding these pieces; 3) dividing the integer represented by the input word by the length of the hash table and using the remainder. BX Collisions The ideal situation would arise if every word placed in the dictionary had a unique hash address. However, as soon as a few slots in the hash table have been filled, the possibility of a collision arises — two or more words producing the same hash address. To differentiate among collided entries, the characters of the dictionary words must be stored along with their concept numbers. Then during lookup, the input word can be compared with the character string to verify that the correct table entry has been located. The problem of where to store the collided items has several methods of solution. [3] The linear scan method places a collided item in the The first free table slot after the slot indicated by the hash address. scan is circular over the end of the table. a crude algorithm to generate random offsets where slot H is the length of the hash table. is examined. The random probe method uses R(i) in the interval [1,H] A , If the colliding address is A+R(l) mod H The process is repeated until an empty slot is found. Both of these methods work best when the hash table is lightly loaded, that is when the ratio between the number of words entered and the number of table slots is small. In such cases the expected length of scan or average number of random probes is small. Chaining methods provide a satisfactory method of resolving collisions regardless of the load on the hash table. However, they require a II-4 second storage table — a bump table — for holding the collided items. When a collision occurs, both entries are linked together by a pointer and placed in the bump table. A pointer to this collision chain in placed in the hash Further colliding items are simply table along with an identifying flag. added to the end of the collision chain. C\ Table Layout and Search. Procedure In the virtual scatter storage system described later, the hash table has a high load factor. Hence the chained method (or rather a variation of Further discussion involves only scatter With this restriction, then, a it) is used to resolve collisions. storage systems using collision chains. scatter storage system consists of a hash table, a bump table, and the associated algorithm for producing hash addresses. A dictionary entry consists of a concept number and the character string for the word it represents. These entries are placed in the hash-bump table as described above. Conse- quently there are three types of slots in the hash table — slots that are empty, slots holding a single dictionary entry, and slots containing a pointer to a collision chain held in the bump table. is shown below. A typical table layout Hash Table empty slot Concept + Char. single dictionary entry Pointer > Entry 1 • > Entry 2 Collision Chain II-5 One of the advantages of scatter storage systems is that the search strategy is the same as the strategy for constructing the hash-bump tables. Given a word, its hash address is computed and the tables searched to find the proper slot. slot. During construction, dictionary information is placed in the The basic search procedure is illustrated by the flow diagram in The construction procedure is similar. Fig. 1, D\ Theoretical Expectations An ideal transformation algorithm produces a unique hash address for each dictionary word and thereby eliminates collisions. From a practical point of view, the best algorithms are those which spread their addresses uniformly over the table space. Producing a hash address is simply the process of generating a uniform random number from a given character string. If the addresses are truly random, a probability model may be used to predict various facts about the storage system. Suppose a hash table has entered in the hash-bump tables. table slots with i entries for Let H H. slots and that N words are to be be the expected number of hash . In other words H is i = 0,1,...N E the expected number of empty slots, entries, and E ,E ,...,E N is the expected number of single are the expected number of slots with various Even though the it€ims are physically located numbers of colliding items. in the bump table, they may be considered to "belong" to the same slot in the hash table. It is expected that: N CD E = I E i=0 N C2> N = I i E. ; 1=0 *• II-6 Input the Text Word Compute Hash Address x "Pointer" -Empty - > Get Next Bump Table Entry ± Dictionary JSl£ No -^ Word Never Entered in Dictionary Yes Yes T> Return Concept Number ± Flow Diagram for the Lookup Procedure in Basic Scatter Storage Systems Fig. 1 II-7 Now let X ll if exactly i items occur in the j slot =\ 0 if exactly i items do not occur in the j slot for j = 1 5 2,• . . , H H. = E L {X.. + X.n + . . . + X.uj } I xl x2 iE Then = I E {X 1 j=l 1] Assume that any chosen table slot is independent of the others so that the probability of getting any single item in the slot is the probability of getting exactly i items in that slot is 1/H . Then ^ \-\im u-ir Then l] i i 1 E { x . . } = 1- P. + 0 ( 1 - P . ) P. Substituting into the above O) H. = H-P. I I ^fl^rM)" For the cases of interest H and N tion can be used in equation C3): P. = e- N / H 1 1 fr i 01 o =, "" H , are large, and the Poisson approxima- CN/H)1 7-i II-B The ratio N/H a is the load factor mentioned previously. so that It is usually designated by (5) H. = H e" a a1 i = 0,1, . . . , N In a form more convenient for hand computation H o = e i = 1,2, . . . , K H. = H. ,-%I i-l I Equation C5) is sufficient to describe the state of the scatter storage system after the entry of N items. Most of the statistics of interest can be preducted using this expression; a few of them are listed in Table 1. The time required for a single lookup using a hash scheme depends on the number of probes into the table space, that is, how many slots must be examined. Suppose the word is actually found. If it is a single entry, only one probe is required. If the word is located in a collision chain, the number of probes is one (for the hash table) plus one additional probe for each element of the collision chain that must be examined. word is not in the dictionary. Suppose that the If its hash address corresponds to an empty However, if the address points table slot, again only one probe is needed. to a collision chain, the number is one plus the length of the chain. For words found in the dictionary the average number of probes per lookup is: (8) P = 1 + jj {(0)H j„ 1=2 i j=l + Cl+2)H2 + tl+2+3)H3 + . . , + Clt2+.„tNlH N i = 1 + I &* I J L II-9 Measure Load factor Number of empty table slots Number of single entries Number of collision chains of length i Expected sums H x = N e" a a = N/E Formula i _2 ' H. = H e" a -T^ X . - 9 0 • X ^,0,,. . ,N N E = y E. X .L 1=0 N N = y iH. 1=0 Fraction of hash table empty Fraction of table filled with single entries Fraction of hash table slots with i entries Expected sums F 0 F =k = E = e -a -a 0 = l E Hl ae F. = rr H. = e J L a. x N -a a i=2,3,. ..,N -TT 1 - .I F. L x 1 =0 N a = y L Number of collisions . x=o i F. H x 3 + Nc = H 2 + ... + HN 0 = H. - H Number of entries in the bump table Total table slots required Average lookup time Cprobes) B = N - H S = M t B P = 2 + « - e -a - H. 1 E = number of hash table slots N = number of words to be entered Expected Storage and Search Properties for Basic Scatter Storage Schemes Table 1 11-10 1 H n +£ I H H N i. = 2 Tfln - iCi+D X i+2i 1 N I i=2 N iCi + DF I t i 1 I N (i+l)F. . N A £_2 1 + lj2 N+l a 1 ) F i-l 2 + [ = 2 F i-l N+l 1+ 2L2 " - Ci 1)F i + 7 L2Fi^ 1 + - a + Cl-F ) 2 + —a - e - Cprobes) 3. Virtual Scatter Storage A) Method From Table 1, the expected number of collisions is N c = H - H o - tL 1 = HCl - e -— e ) For a fixed N , this number decreases as H increases. At the same tine the number of empty hash table slots H o = H e"N/H increases as H increases. Both of these results are expected — as the 11-11 hash addresses are spread over a larger and larger table space (H slots), the number of collisions should decrease and the number of empties increase for a fixed number of entries (N). A virtual scatter storage scheme tries to balance these opposing strains by combining hash coding with a sparse storage technique. Large or virtual hash addresses are used to obtain the collision properties associated with a very large hash table, and the storage technique is used to achieve the storage and search properties of a reasonably sized hash table. If the virtual hash address is taken large enough the expected number of collisions can be reduced to essentially zero. With no expected collisions, it is possible to dispense with verifying that a query word and the dictionary word are the same. It is enough to check that they produce the same virtual address. Hence, the character strings need not be stored in the hash-bump tables at all. To implement the virtual scheme a large hash address is computed, say in the range [0,V], and is split into a major and minor part. The major portion is used just as before — as an index on a hash table of size H. Instead of storing the character string in the hash or bump table, the With this difference, the virtual scheme The lookup procedure is identical, but minor portion is substituted. works just as the basic scheme. the minor portions are used for comparison rather than character strings. All the results of the previous section apply as storage and timing estimates. The advantage of virtual scatter storage systems is economy of storage space. The size of the minor portion is much smaller than the size It is true, that the virtual scheme of the character string it replaces. 11-12 assigns the same concept number to two different words if they have the same virtual address. This need not be disastrous for document retrieval applications. Presumably V is chosen large enough to keep the number of colli- sions small. On the one hand, errors could be neglected because of their low probability of occurrance and their small effect on the total performance of the retrieval system. On the other hand, it is always possible to resolve detected collisions even in a virtual scheme. Collisions may be detected during dictionary construction or updating, and the characters for the colliding words appended to the bump table. The hash or bump table entry must contain a pointer to these characters along with an identifying flag. Collisions occurring during actual lookups cannot be detected. B) Collision Problem In order to use a virtual hash- scheme, the virtual table must be large enough to reduce to expected number of collisions to an acceptable level. From a practical point of view, a collision may be considered to It Is assumed that the Let V be involve only 2 words, rather than 3, M , or more. - probability of these other types of collisions is negligible* the size of the virtual hash table. Then the expected number of collisions is simply N c = H 2 2 e -V- where a = — . In this case V>>N so that a is small and e is approxi- mately 1., 11-13 2 (9) N c = V 5L 2 N2 2V 13 N = 2 words. 1 Suppose, for example, the dictionary has size of the virtual hash table is chosen to be number of collisions is If the )f V = 2 , then the expected c (213)2 2C226X 1 2 Suppose further that this table size is adopted for the dictionary, and that the hash code algorithm produces 3 collisions. The question arises whether the algorithm is a good one — whether it produces uniform random addresses. The answer is found by extending the previous probability model. Consider a virtual scatter storage scheme in which the virtual table size is V , and N items are to be entered into the hash-bump tables. Let Again assume that collisions involve only two items. P(i) = Prob {i collisions} = Prob {i table slots have 2 items and N-2i slots have 1 item} The number of ways of choosing the ordered way) is: i pairs of colliding words Cin an f N [N-21 . . . N-2i+2| 2 1 01-211! There are i! ways of ordering these pairs and 11-14 v: (v)„ . = CV-N+i}! 'N-i ' ways of placing the pairs in the hash table, so that (10) P C i ) % i . , (N-2i)l 2-ij <-« " ' • - ^ r fori=a,l,...,L|j In a form for hand computation, en) PCO) = a - i-x a - fi . . . a - ^~i 2iCv-N+iy for x = 1 PCX) = P C I - i ) »2."-»LyJ These results are exact, but the following approximations can be used with accuracy N-l log PCO) = I log Cl - ^ N-l si 2V Let 3 N = — • Terms l i n e a r in . • N may be n e g l e c t e d in equation (11) g i v i n g PCO) = e~ 6 PCi) = I PCi-l> This i s a l s o a Pois-son d i s t r i b u t i o n : C12) PCi) = e-® C a.: ^r i=0,l,2,...,L|j 11-15 This equation gives the approximate probability of virtual scatter storage scheme. i collisions for a It may be used to form a confidence N = 6. ~y 5 interval around the expected number of collisions For the previous example in which the following table of values can be made: V = 2 , N = 2 ' ^ N i 0 1 2 3 PU> .607 .303 .076 .012 £PCU .607 .910 .986 .998 The probability is .986 that the number of collisions is less than or equal to 2. Since the algorithm gave 3 collisions, it appears to be a poor one. The results for the collision properties are summarized in Table 2. 4. Experiments with Algorithms for Generating Hash Addresses Any scatter storage scheme depends on a good algorithm for producing hash addresses. This especially is true for virtual schemes in which collisions are to be eliminated. In these experiments three basic algorithms The words in two dictionaries — The hash-bump tables are evaluated for use in virtual schemes. the ADI Wordform and CRAN 1400 Wordform — are used. are filled using these words and the resulting collision and storage statistics compared with the expected values. A) Dictionaries The ADI Wordform contains 7822 words pertaining to the field of docu- 11-16 Measure Formula Collision factor Expected number of collisions Probability of i collisions B-N2 6 1 i=0 ' IT * ! N = 6 c PCi) = e~e |i b Probability that the number of collisions C lies in [a,b] Prob = I P(i) i=a V N virtual hash table size number of words to be entered Expected Collision Properties for Virtual Scatter Storage Systems Table 2 11-1/ mentation. characters. It contains 206 common words (previously judged) averaging 3.93 The remaining 7616 noncommon words average 8.00 characters. In all there are 61,712 characters. The CRAN 1400 Wordform contains 8926 words dealing with aeronautics. The common word list consists of that of the ADI, plus four additional entries. The 8716 noncommon words average 8.40 characters. characters. Figs. 2 and 3 show the distribution of the length of the words versus percentage of collection. The abrupt end to the curves in Fig. 2 is due to There is a total of 74,074 truncation of words to 18 characters. Both dictionaries have approximately the same size and proportions of words of various length. different. However, their vocabularies are considerably A good hash scheme should work equally well on both dictionaries. B) Hash Coding Algorithms The By their nature, hash coding algorithms are machine dependent. computer representation of the alphabetic characters, the way in which arithmetic operations are done, and other factors all affect the randomness of the generated address. on the IBM S/360. Words are padded with some character to fill an integral number of S/360 fullwords. Then the fullwords are combined in some manner to form a The algorithms described below are intended for use single fullword key, and the final hash address is computed from this key. In the experiments which follow, the blank is used as a fill character. This is an unfortunate choice because of the binary representation of the blank 01000000. In some algorithms the zeroes may propagate or otherwise affect A good fill character Is one that the randomness. 1) 2) is not available on a keypunch or teletype, will not propagate zeroes, 11-18 O Common Words A ADI D CRAN 1400 8 10 12 14 Word Length Distribution of Dictionary Words According to Their Lengths Fig. 2 11-19 O Common Words A ADI D CRAN 1400 10 12 18 20 Word Length Cumulative Distribution of Dictionary Words According to Their Lengths Fig. 3 11-20 3) will generate a few carries during key formation, and M ) has the majority of its bits equal to 0, so their positions may be filled. A likely candidate for the S/360 is 01000101. Three basic methods of generating virtual hash addresses — addition, multiplication, and division — are studied. The first and second provide The second and third contrasting ways of forming the single fullword keys. differ in the way the hash address is computed from the key. Variations of each basic method are also tested to try to improve speed, programming ease, or collision-storage properties. 1. Addition Methods AC — addition and center The fullwords of characters are logically added to form the key. The key is squared and the centermost bits are selected as the major. major. AS — addition with shifting Same as AC, except the second, third, etc. full words are shifted two positions to the left before their addition in forming the key, properties) AM — addition with masking Same as AC, except the second, third, etc. fullwords have certain nonsignificant bits altered by masks before their addition in forming the key. storage properties) (An attempt to improve collision(An attempt to improve collision-storage The minor is obtained from bits on both sides of the 2. Multiplication Methods MC — multiply and center The fullwords of characters are multiplied together to form 11-21 the key. The center bits of the previous product are saved The key is squared The minor is as the multiplier for the next product. and the centermost bits selected as the major. obtained from the bits on both sides of the major. MSL — multiply and save left Same as MC, but during formation of the key, the high order bits of the products, rather than the center, are used as successive multipliers. CAn attempt to improve speed) MLM — multiply with left major Same as MC, but taking the major from the left half of the square of the key and the minor from the right half. attempt to improve speed) 3. Division Methods DP — divide by prime The fullwords of characters are multiplied together to form the key. The center bits of the previous product are saved The key is divided (An as the multiplier for the next product. by the length of the virtual hash table — a prime number in this case — and the remainder used as the virtual hash address. The major is drawn from the left end of the virtual address and the minor from the right. DO — divide by odd number Same as DP, except using a hash table whose length is odd. (An attempt to provide more flexibility of hash table sizes) DT — divide twice Same as DP, except two divisions are made. The major is produced by dividing the key by the actual hash table size. The minor results from a second division. throughout as divisors. collision properties) Primes are used CAn attempt to improve storage- C) Evaluation In the experiments to evaluate each variation of the above hash 11-22 schemes, the size of the virtual hash table varies from 2 to 2 slots. 12 m The actual hash table varies in size from 2 to 2 slots. Bump table space is used as needed. The tables are filled by the words from either the ADI or CRAN dictionaries and the collision and storage statistics taken. Because good collision properties are most important, they are examined first. The storage properties are dealt with later. The number of collisions obtained from each scheme versus the virtual table length is plotted in Figs. 4 to 7. The AD1 dictionary is shewn in Figs. 4 and 6, and the CRAN in Figs. 5 and 7. curves The circled lines correspond to generated from equations C9) and Cl2). The horizontal one shows the expected number of collisions and the lines above and below it enclose a 95% confidence interval about the expected curve. In other words, if an algorithm is generating random addresses, the probability is 95% that the curve for that scheme lies between the heavy lines. Consider Figs. 4 and 5 showing the results for all the addition methods and the MC variation of the multiplication variation. The AC and MC algorithms differ only in that addition is used in forming the key in the first one and multiplication in the second one. Yet the curves are spectacularly different. The result seems to have the following explanation* The purpose of a hash address computation is to generate a random number from a string of characters. If the bits in the characters are as varied as possible, then the algorithm has a headstart in the right direction. However, the S/360 bit patterns for the alphabet and numbers are: A to 1 J to R S to Z 0 to 9 1100 xxxx 1101 xxxx 1110 xxxx 1111 xxxx 11-23 ooooo Theoretical Curves (Equations (9) and (12)) Experimental Curves Interpolated Curves 49 45 41 37 29 25 18 16 16 12 8 ^C,AM33 AS 29 |-ooooooooooooc7 25 21 kjo1" 20 Virtual Hash Table Size (Power of two) °o, ooooooooooooo "OOOOOOOOOOOO HAC,AM r 4h AS 1C J oooooooooo — O^ICjSSS^ 000 8888888888 ° ^ C 26 28 MC<*>°°o°' -£$ 0| 22 24 Collisions in the ADI Dictionary for Addition and Multiplication Hash Schemes Fig. 4 ooooo Theoretical Curves (Equations (9) and (12)) Experimental Curves Interpolated Curves 58 54 50 46 29 18 14 10 17 13 9 1 6 12 M&40 °o 0 ooooooooooooo 38 34 oooooooooy oooooooooooo jaoooooooooooo Soooool 30 26 20 Virtual Hash Table Size (Power of two) Collisions in the CRAN Dictionary for Addition and Multiplication Hash Schemes Fig. 5 11-25 ooooo Theoretical Curves Experimental Curves Interpolated Curves 26 22 18 14 mo 16 12 DP 8 DO. 16 12 8 fDO 10 ^"""•P--^ 6h at -O DP DRi °<5io5boooooooooooo. DT^JMCa^2-"^ DT oooooooooooo2\-ooooQQQoqpoocmMCQ8SS88§§tiCt05\ MC.DP z T I 25h 26 28 20 Virtual Hash Table Size (Power of two) Collisions in the ADI Dictionary for Division and Multiplication Hash Schemes Fig. 6 ooooo Theoretical Curves (Equations (9) and (12)) Experimental Curves Interpolated Curves >oo 2l-oooooooooQfiPT^Toooooooooo^pjOPDT^MC Virtual Hash Table Size (Power of two) Collisions in the CRAN Dictionary and Multiplication Fig. 7 for Division Hash Schemes 11-27 In each case the two initial bits of a character are lTs so that in any given word -- of the bits are the same. r In forming a key, the successive additions in the AC algorithm may obscure these nonrandom bits if sufficient number number of carries are generated. However, the number of additions performed is usually small — 2 or 3 — and it appears that the patterns are not broken sufficiently. The MC algorithm uses multiplication to form its keys which involves some 31 additions — certainly enough to make the resulting key random. fThe multiplications ?. the MC algorithm are costly in terms of com*n Therefore the AS and AM algorithms are tried. These addi- putation time. tion variants try to hasten the breakup of the nonrandom bits by shifting and masking respectively. Although these variants reduce the number of collisions somewhat, none of the addition schemes could be called random. Typically a few words are singled out at some point and continue to collide regardless of the length of the virtual address. are listed below. Several collision pairs Note the similarities between the words. COUNT WORTH TOLERATED WHEEL SOUND FORTY TELEMETER SHEET During key formation, the Consider the multiplication algorithms. process of saving the center of successive products adds to the computation time. The MSL variation attempts to remedy this by saving only the high order bits between multiplications Con the S/360 this means saving the upper 32 bits of the 64 bit product). This method is so inferior that its collision graph could not be included with the others. The poor results 11-28 stem from the fact that characters at the end of fullwords have little effect on the key and that the later multiplications swamped the effects of the earlier ones. Examples of collision pairs are given below. For convenience the fullwords are separated by blanks. CERT AINT Y PREV EKTE D HEAV ING EXPE NSE CHAR TER I CERT AINL Y PRES ENTE D HEAT ING EXPA NSE CHAP TER - The MC and MLM variants are identical with respect to collision properties. In general these algorithms produce good results, reducing the num- ber of collisions to zero in both dictionaries. The collision curve is always beneath the expected one. Consider Fig. 7 and 8 showing the results for all division methods and the MC method. All of the division algorithms display a distinct rise 24 m the number of collisions when the virtual table size is near 2 — regard- less of the dictionary. The majority of the colliding word pairs are 4 char- acter words having the same two middle letters. This brings to light a curious fact about division algorithms. For virtual tables, the divisor of the key is large and the initial few bits determine the quotient, leaving the rest for the remainder. For words of less than 4 characters (which require 24 no multiplications during key formation)> dividing by 2 is equivalent to selecting the last 3 characters of the word as the hash address. Because . . 24 the divisors are not exactly equal to 2 , only the two middle characters tend to be the same. Examples are: DEAL BEAR 11-29 TOOK HELD VERB - SOON CELL TERM This phenomenon apparently continues for table sizes around 2 and 28 2 , but there are few or no words of 4 characters or less which agree in 26 or 28 bits. 24 Eor divisors smaller than 2 , a larger part of the key determines the quotient and apparently breaks up the pattern. Because the above effect occurs only for passed c^ver on the graphs. In general, the DT algorithm is superior to the rest of the division methods, mostly because each of its two divisors is smaller than those used in other methods. divisors. On the basis of collision properties, the MC, MLM, DT, and possibly AS algorithms are the best. these methods only. The experiments with each hash coding method also include counting the frequency of various length of collision chains. Here a collision The frequency Storage-search evaluations are included for Prime numbers seem to produce better results than other V = 2 24 , these points are chain refers to chains of words producing the same major. counts are compared with the expected counts given by equation (5). The comparison is in terms of a chi-square goodness of fit test with a 10% level of significance. dictionary. Figs. 8 and 9 show the results of this test for each Included in the graphs is the line corresponding to the 10% If the major portions of the hash addresses are level of significance. really random, there is a probability of 0.90 that the 10% line will lie above the curve for the algorithm tested. AS.DT MLM MC-MC-MC -MC- _L 20 22 JL 24 26 28 Virtual Hash Table Size (Power of two) X —curve for 10% level of significance Deviations of Values for Storage-Search Properties Storage Hash Dictionary from Expected ADI Schemes using the Fig. 8 MLM MLM _L 20 22 J_ 24 J_ 26 28 Virtual Hash Table Size (Power of two) X —curve for 10% level of significance Deviations of Storage —Search Properties from Expected Values for Selected Hash Schemes using the CRAN Dictionary Fig. 9 11-32 Consider the MC and MLM algorithms which differ only in that the major is selected from the center and left of the virtual address. From the graphs, it is clear that the multiplication methods produce their mos": random bits in the center of their product. This is somewhat as expected because the center bits are involved in more additions than other bits. The division algorithm, which had fairly good collision properties, seems to have rather mediocre storage properties. This is probably due to the same causes as the collision problems, but working at a lower level, and not affecting the results as much. The AS curve is included simply for completeness. The scheme displays a well behaved storage curve, but unfortunately it has poor collision properties. In summary, the MC scheme seems to be the best for both dictionaries in terms of collision and search properties. In terms of computing time, the method is more time consuming than the addition methods, but less expensive than the division methods. The differences in computation times is not an extremely big factor. All methods required from 35 to 55 microseconds The routines are coded in assembly for an 8 character word on the S/360/65. language and called from a Fortran executive. The times above include the necessary bookkeeping for linkage between the routines. 5. A Practical Lookup Scheme A) General Description The lookup scheme described below is designed for use with dictionaries 15 29 of about 2 v words. The virtual table size selected is 2 ' and the actual 15 table size is 2 . On the basis of the results presented in previous sections, 11-33 when the dictionary is full, it is expected that 1) 2) 36.8% of the hash table will be empty, 36.8% of the hash table will be single entries, 15 3) the bump table will require (0.632)2 v e n t r i e s , *+) 1 collision is e x p e c t e d , 5) the probability of 5 or fewer collisions is 0.999, and 6) the average lookup will require 2.13 probes. (B) Table Layout In all previous discussions a dictionary entry has included a minor and a concept number. A concept number is simply a unique number assigned to each word. used. ber. A dictionary entry contains a 14 bit minor and a single bit indicating whether the word is common or noncommon; The hash address of a word is also unique, and hence can be There is no need to store and use a previously assigned concept num- 1 2 Minor 15 C = 0 C = 1 implies the word is common implies the word is noncommon. A hash table entry contains 16 bits arranged as: 0 Flag 1 Information lb Flag = 0 implies that the information is a dictionary entry 11-34 Flag = 1 implies that the information is a pointer to the bump table Words that have the same major are stored in a block of consecutive locations in the bump table. This eliminates the need for pointers in the collision "chains". A bump table entry also has 16 bits structured as: 0 1 1 C 2 Minor 1 . • . • • • • , . . • „ | • . 15 „ • • • End End = 0 implies that the entry is not the last in the collision block End = 1 implies that the entry is the last in the block. Some convention must be adopted to signify an empty hash table slot. A zero is most convenient in the above scheme. Unfortunately a zero is also a legitimate minor. However, to cause trouble the word generating the zero minor would have to be a common word and a single table entry (zero minors in the bump table are no problem). Hopefully this occurs rarely because of the size of the minor (14 bits) and the small number of common words. However, even if this combination of circumstances occurs, the common word could be placed in the bump table anyway. In designing the tables, it is important to make the hash table entries large enough to accommodate the largest pointer anticipated for the bump table. 2 For the above scheme, the expected bump table size is less than so that the 15 bits allocated for pointers is sufficient. C) Search Considerations The number of probes needed to locate any given word depends on the 11-35 place that the word occupies in a collision block. The average search time is improved if the most common words occupy the initial slots in each block, A study of ADI text yields the following statistics. Division of Words by Category Number of Words 17270 Total words 8716 Common words 8554 Noncommon words Percent of Total 100.0 50.5 49.5 Distribution of Lengths Number of Characters All Words Percent Common Words Percent Noncommon Words 2097 4003 2217 221 11 5 8554 8.3 Percent 1-4 5-8 9-12 13-16 17-20 21-24 Totals Average Length 10145 4630 2249 221 11 5 17270 6.3 58.8 26.8 13,0 1.3 0.1 8057 627 32 0 92.5 7.2 0.3 0.0 0.0 0.0 100.0- 24.5 46.8 25.9 2.6 0. .1 0.1 100.0 a 0 8716 4.3 a. a 100,0 Using the categorical information, it appears that in filling the hash-bump tables, the common words should be entered first. Within each category, all words should be entered in frequency order if such information is known, If frequency information is not available, the distribution by lengths can 11-36 be used as an approximation to it. For common words, this means entering the shorter words first. For noncommon words, the words of 5 to 8 charac- ters should be entered first. The greater the number of single entries, the greater the average search speed. Fig. 1Q shows the fraction of single entries (J,) and frac- tion of empty slots (J ) for various load factors. The fraction of single entries F —t o = ae reaches a maximum for a = 1, but since the slope of the curve is small around this point, and load factor in the interval CD.8, 1.2) is practically the same. Table usage is better, however, for the larger values of a. These facts imply that scatter storage schemes make most a = 1. Thus the efficient use of space and time for Most text words can be assumed to be in the dictionary. order of comparisons during lookup should be: Hash Table Scan 1) check minor assuming the text word is a common word 2) check minor assuming the word is noncommon 3) check if the entry is a pointer to the bump table 4) check if the entry is empty First Bump Table Entry Cmust be at least two) 5) check minor assuming the word is a common word 6) check minor assuming the word is noncommon Other Bump Table Entries 7) check minor assuming the word is noncommon 8) check minor assuming the word is common 9) check if at end of collision block. 11-37 .70 60 O A Fraction Empty Slots Fraction of Single Entries P o X 50 .40 A" c o a o .30 / / .20 ioh .8 1.2 Load Factor 1.6 2.0 Theoretical Hash Table Usage F i g . 10 11-38 The search pattern can be varied to take advantage of the storage conditions. For example, if all common words are either single entries or the first element of a collision block, then step 8) may be eliminated. D) Performance The lookup system described above has been implemented and tested on the IBM S/360/65. A modified form of the MC algorithm is used to com- pute a 29 bit virtual address and divide it into a 15 bit major and a 1 4 bit minor. The modification is the inclusion of a single left shift of the fullwordi of characters during key formation. This breaks up certain types of symmetries between words such as WINGTAIL and TAILWING. Without this, such words will always collide. The hash-bump tables were filled with entries from the ADI dictionary — common words first, followed by noncommon words. The shortest words were entered first. A comparison of the expected and actual results follows. a = .239 Number of empty table slots Number of single entries Number of collision blocks Longest collision block Average length of collision blocks Size of bump table Number of collisions Average probes per lookup Expected 25810 6161 797 4 2.1 16-63 .06 1.33 Actual 25762 6250 756 4 2,1 1572 1.33 11-39 To obtain the actual lookup times 627 words were processed. words were read from cards and all punctuation removed. The Each word was passed to the lookup program as a continuous string of characters with the proper number of fill characters added. below (in microseconds): The resulting times are given Category of Words All v Common Noncommon Not found* Number of Words 627 288 338 1 Percent of Total 100.0 45.9 53.9 0.2 Average Time 57.9 49.9 64.7 53.1 Standard Deviation 11.7 6.7 10.7 0.0 Average Probes 1.18 1.12 1.24 1.00 * A larger sample with less accurate timings indicates that the average time for words in this category is about 62 microseconds (standard deviation 26). The time to compute a hash address depends on the length of the word. acters. Let n be the number of S/360 fullwords needed to hold these char- The time to form the initial address is I(n) = 34.5 + 10.2 (n - 1) microseconds. The average total lookup time, then, is T = I(n) + cP where c is the average time per probe into the table space and For the words in the experiment P is the average number of probes. I(n) = 40.3, and onds. T = 57.9 n = 2.3 2 (average), so that each probe required about 15 microsec- 11-40 E) Comparisons Timing information for other lookup schemes is difficult to obtain. A tree structured dictionary is used for a similar purpose at Harvard. lished information indicates 6pq microseconds are needed to process in a dictionary of q entries. This time is for the IBM 709*4. p Pubwords Translating this time to the S/360/65, which is roughly 4 times faster, and using the ADI dictionary (q = 7822), it appears that each lookup averages 11,000 microseconds. unknown. Exactly how much computation and input-output this includes is 6. Extensions A) Larger Dictionaries As more words are added to the dictionary, the size of the virtual address must increase in order to prevent collisions. As a result, the num- ber of bits per table slot must also increase in order to accommodate the larger minors and pointers that are used. For a fixed sized hash table, the At some number of entries in the bump table grows as new words are added. point the space required for tables will exceed the amount of core allotted for dictionary use. To salvage the scheme, it may be possible to split the bump table into parts — one part for more frequently used words and one for words in rather rare usage. During dictionary construction common words When a rare word must be are entered first, then noncommon, then rare. placed in a collision block, a marker is stored instead, and the item is placed in the secondary bump table. Presumably the nature of the words in the second bump table will make its usage rather infrequent, thus saving accesses to auxiliary storage to fetch it. 11-41 B) Suffix Removal Many dictionary schemes store only word stems; the lookup attempts to match only the stem, disregarding suffixes in the process. easily done with scatter storage schemes. This is not One solution is to try to remove Each of the various possiAnother the suffix after an initial search has failed. ble stems must be looked up independently until a match is found. solution is to use a table of correspondences between the various forms of a word and its stem. The concept number could be used as an index on this A thesaurus table containing pointers to information about the actual stem. lookup can be handled the same way. 7. Conclusions Virtual scatter storage schemes are well suited for dictionaries, having both rapid lookup and economy of storage. The rapid lookup is due to the fact that the initial table probe limits the search to only a few items. The space savings come from the fact that the actual character The schemes depend The strings for words are not part of the dictionary. heavily on a good algorithm for producing random hash addresses. theory developed in Sections 2 and 3 gives a basis for judging the worth of proposed algorithms. For any particular application, the table organization may vary to suit different needs and to store different information. advantages of scatter storage schemes are still present. However, the References [1] Salton, G.f A document retrieval system for man-machine interaction, Association for Computing Machinery, Proceedings of the 19th National Conference, Philadelphia, Pennsylvania, August 25-27, 1964, pp. L2.3-1 — L2.3-20. Mcllroy, M. D., Dynamic Storage Allocation, unpublished manuscript, Bell Telephone Laboratories, Inc., 1965. Morris, R., "Scatter Storage Techniques", Communications of the ACM, January, 1968. Maurer, W. D., "An Improved Hash Code for Scatter Storage", Communications of the ACM, January, 1968. Johnson, L. R., "Indirect Chaining Method for Addressing on Secondary Keys", Communications of the ACM, May 1961. [2] [3] [4] {5J