|
|
1.1 root 1: .\" @(#)refer 6.1 (Berkeley) 5/22/86
2: .\"
3: .... refer | tbl | nroff -ms
4: .EH 'USD:30-%''Some Applications of Inverted Indexes on the UNIX System'
5: .OH 'Some Applications of Inverted Indexes on the UNIX System''USD:30-%'
6: .nr LL 6.5i
7: .nr LT 6.5i
8: .de UC
9: \\s-2\\$1\\s0\\$2
10: ..
11: .ds . \&\s+2.\s0
12: .if t .ds -- \(em
13: .if n .ds -- --
14: .TR 69
15: \".TM 77-1274-17 39199 39199-11
16: .ND October 27, 1977
17: .ND June 21, 1978
18: .TL
19: Some Applications of Inverted Indexes on the UNIX System
20: .AU "MH 2C-572" 6377
21: M. E. Lesk
22: .AI
23: .MH
24: .\".AB
25: .\".LP
26: .\".ft B
27: .\"I. Some Applications of Inverted Indexes \- Overview
28: .\".ft R
29: .\".PP
30: .\"This memorandum describes a set of programs which
31: .\"make inverted indexes to
32: .\"UNIX*
33: .\"text files, and their
34: .\"application to
35: .\"retrieving and formatting citations for documents prepared using
36: .\".I troff.
37: .\".PP
38: .\"The indexing and searching programs make keyword
39: .\"indexes to volumes of material too large for linear searching.
40: .\"Searches for combinations of single words can be performed quickly.
41: .\"The programs for general searching are divided into
42: .\"two phases. The first makes an index from the original
43: .\"data; the second searches the index and retrieves
44: .\"items.
45: .\"Both of these phases are further divided into two parts
46: .\"to separate the data-dependent and algorithm dependent
47: .\"code.
48: .\".PP
49: .\"The major current application of these programs is
50: .\"the
51: .\".I troff
52: .\"preprocessor
53: .\".I refer.
54: .\"A list of 4300 references is maintained on line,
55: .\"containing primarily papers written and cited by
56: .\"local authors.
57: .\"Whenever one of these references is required
58: .\"in a paper, a few words from the title or author list
59: .\"will retrieve it, and the user need not bother to re-enter
60: .\"the exact citation.
61: .\"Alternatively, authors can use their own lists of papers.
62: .\".PP
63: .\"This memorandum is of interest to
64: .\"those who are interested in facilities for searching large
65: .\"but relatively unchanging text files on
66: .\"the
67: .\"UNIX
68: .\"system,
69: .\"and those who are interested in handling bibliographic
70: .\"citations with
71: .\"UNIX
72: .\".I troff.
73: .\".LP
74: .\".ft B
75: .\"II. Updating Publication Lists
76: .\".PP
77: .\"This section is a brief note describing the
78: .\"auxiliary programs for managing the updating
79: .\"processing.
80: .\"It is written to aid clerical users in
81: .\"maintaining lists of references.
82: .\"Primarily, the programs described permit a large
83: .\"amount of individual control over the content
84: .\"of publication lists while retaining the
85: .\"usefulness of the files to other users.
86: .\".LP
87: .\".ft B
88: .\"III. Manual Pages
89: .\".PP
90: .\"This section contains the pages from the
91: .\"UNIX programmer's manual
92: .\"dealing with these commands.
93: .\"It is useful for reference.
94: .\".sp
95: .\"\l'3i'
96: .\".br
97: .\"* UNIX is a trademark of Bell Laboratories.
98: .\".AE
99: .CS 10 4 14 0 0 4
100: .NH
101: Introduction.
102: .PP
103: The
104: .UX
105: system
106: has many utilities
107: (e.g. \fIgrep, awk, lex, egrep, fgrep, ...\fR)
108: to search through files of text,
109: but most of them are based on a linear scan through the
110: entire file, using some deterministic automaton.
111: .ev 1
112: .ps 8
113: .vs 10p
114: .ev
115: This memorandum discusses a program which uses inverted
116: indexes
117: .[
118: %A D. Knuth
119: %T The Art of Computer Programming: Vol. 3, Sorting and Searching
120: %I Addison-Wesley
121: %C Reading, Mass.
122: %D 1977
123: %O See section 6.5.
124: .]
125: and can thus be used on much larger data bases.
126: .PP
127: As with any indexing system, of course, there are some disadvantages;
128: once an index is made, the files that have been indexed can not be changed
129: without remaking the index.
130: Thus applications are restricted
131: to those making many searches
132: of relatively stable data.
133: Furthermore, these programs depend on hashing, and can only
134: search for exact matches of whole keywords.
135: It is not possible to look for
136: arithmetic or logical expressions (e.g. ``date greater than 1970'') or
137: for regular expression searching such as that in
138: .I lex.
139: .[
140: lex lesk cstr
141: .]
142: .PP
143: Currently there are two uses of this software,
144: the
145: .I refer
146: preprocessor to format references,
147: and the
148: .I lookall
149: command to search through all text files on
150: the
151: .UX
152: system.\(dd
153: .FS
154: \(dd \fIlookall\fP is not part of the Berkeley UNIX distribution.
155: .FE
156: .PP
157: The remaining sections of this memorandum discuss
158: the searching programs and their uses.
159: Section 2 explains the operation of the searching algorithm and describes
160: the data collected for use with the
161: .I lookall
162: command.
163: The more important application,
164: .I refer
165: has a user's description in section 3.
166: Section 4 goes into more detail on
167: reference files
168: for the benefit of those who
169: wish to add references to data bases or
170: write new
171: .I troff
172: macros for use with
173: .I refer.
174: The options to make
175: .I refer
176: collect identical citations, or otherwise relocate and adjust references,
177: are described in section 5.
178: .NH
179: Searching.
180: .PP
181: The indexing and searching process is divided into two phases,
182: each made of two parts.
183: These are
184: shown below.
185: .IP A.
186: Construct the index.
187: .RS
188: .IP (1)
189: Find keys \*(-- turn the input files into a sequence of tags and keys,
190: where each tag identifies a distinct item in the input
191: and the keys for each such item are the strings under which it is
192: to be indexed.
193: .IP (2)
194: Hash and sort \*(--
195: prepare a set of inverted indexes from which, given a set of keys,
196: the appropriate item tags can be found quickly.
197: .RE
198: .IP B.
199: Retrieve an item in response to a query.
200: .RS
201: .IP (3)
202: Search \*(--
203: Given some keys, look through the files prepared by the hashing
204: and sorting facility and derive the appropriate tags.
205: .IP (4)
206: Deliver \*(--
207: Given the tags, find the original items. This completes the
208: searching process.
209: .RE
210: .LP
211: The first phase, making the index, is presumably done relatively infrequently.
212: It should, of course, be done whenever the data being
213: indexed change.
214: In contrast, the second phase, retrieving items,
215: is presumably done often, and must be rapid.
216: .PP
217: An effort is made to separate code which depends on the data
218: being handled from code which depends on the searching procedure.
219: The search algorithm is involved only in programs
220: (2) and (3), while knowledge of the actual data files is
221: needed only by programs (1) and (4).
222: Thus it is easy to adapt to different data files or different
223: search algorithms.
224: .PP
225: To start with, it is necessary to have some way of selecting
226: or generating keys from input files.
227: For dealing with files that are basically English, we have
228: a key-making program which automatically selects words
229: and passes them to the hashing and sorting program (step 2).
230: The format used has one line for each input item,
231: arranged
232: as follows:
233: .DS
234: name:start,length (tab) key1 key2 key3 ...
235: .DE
236: where
237: .I name
238: is the file name,
239: .I start
240: is the starting byte number,
241: and
242: .I length
243: is the number of bytes in the entry.
244: .PP
245: These lines are the only input used to make the
246: index.
247: The first field (the file name, byte position, and byte count)
248: is the tag of the item
249: and can be used to retrieve it quickly.
250: Normally, an item is either a whole file or a section of a file
251: delimited by blank lines.
252: After the tab, the second field contains the keys.
253: The keys, if selected by the automatic program, are
254: any alphanumeric strings which
255: are not among the 100 most frequent words in English
256: and which are not entirely numeric (except for four-digit
257: numbers beginning 19, which are accepted as dates).
258: Keys are truncated to six characters and converted to lower case.
259: Some selection is needed if the original items are very large.
260: We normally just take the first
261: .I n
262: keys, with
263: .I n
264: less than 100 or so; this replaces any attempt at intelligent selection.
265: One file in our system is
266: a complete English dictionary; it would presumably be retrieved for all queries.
267: .PP
268: To generate an inverted index to the list of record tags and keys,
269: the keys
270: are hashed
271: and sorted to produce an index.
272: What is wanted, ideally, is a series of lists showing the tags associated
273: with each key.
274: To condense this,
275: what is actually produced is a list showing the tags associated
276: with each hash code, and thus with some set of keys.
277: To speed up access and further save space,
278: a set of three or possibly four files is produced.
279: These files are:
280: .KS
281: .bd 2 2
282: .TS
283: center;
284: c c
285: lI l.
286: File Contents
287: entry Pointers to posting file
288: for each hash code
289: posting Lists of tag pointers for
290: each hash code
291: tag Tags for each item
292: key Keys for each item
293: (optional)
294: .TE
295: .bd 2
296: .KE
297: The posting file comprises the real data: it contains a sequence of lists
298: of items posted under each hash code. To speed up searching,
299: the entry file is an array of pointers into the posting file, one per potential
300: hash code.
301: Furthermore, the items in the lists in the posting file are not referred to by their
302: complete tag, but just by an address in the tag file, which
303: gives the complete tags.
304: The key file is optional and contains a copy of the keys
305: used in the indexing.
306: .PP
307: The searching process starts with a query, containing several keys.
308: The goal is to obtain all items which were indexed under these keys.
309: The query keys are hashed, and the pointers in the entry file used
310: to access the lists in the posting file. These lists
311: are addresses in the tag file of documents posted under the
312: hash codes derived from the query.
313: The common items from all lists are determined;
314: this must include the items indexed by every key, but may also
315: contain some items which are false drops, since items referenced by
316: the correct hash codes need not actually have contained the correct keys.
317: Normally, if there are several keys in the query, there are not
318: likely to be many false drops in the final combined list even though
319: each hash code is somewhat ambiguous.
320: The actual tags are then obtained from the tag file, and to guard against
321: the possibility that an item has false-dropped on some hash code
322: in the query, the original items are normally obtained from the delivery
323: program (4) and the query keys checked against them
324: by string comparison.
325: .PP
326: Usually, therefore, the check for bad drops is made against the original file.
327: However, if the key derivation procedure is complex, it may be preferable
328: to check against the keys fed to program (2).
329: In this case the optional key file which contains the
330: keys associated with each item is generated, and the item tag is supplemented
331: by a string
332: .DS
333: ;start,length
334: .DE
335: which indicates the starting byte number in the key file and the length of
336: the string of keys for each item.
337: This file is not usually necessary with the present
338: key-selection program, since the keys
339: always appear in the original document.
340: .PP
341: There is also an option
342: (\f3-C\f2n\|\f1)
343: for coordination level searching.
344: This retrieves items which match all but
345: .I n
346: of the query keys.
347: The items are retrieved in the order of the number
348: of keys that they match.
349: Of course,
350: .I n
351: must be less than the number of query keys (nothing is
352: retrieved unless it matches at least one key).
353: .PP
354: As an example, consider one set of 4377 references, comprising
355: 660,000 bytes.
356: This included 51,000 keys, of which 5,900 were distinct
357: keys.
358: The hash table is kept full to save space (at the expense of time);
359: 995 of 997 possible hash codes were used.
360: The total set of index files (no key file) included 171,000 bytes,
361: about 26% of the original file size.
362: It took 8 minutes of processor time to
363: hash, sort, and write the index.
364: To search for a single query with the resulting index took 1.9 seconds
365: of processor time,
366: while to find the same paper
367: with a sequential linear search
368: using
369: .I grep
370: (reading all of the tags and keys)
371: took 12.3 seconds of processor time.
372: .PP
373: We have also used this software to index all of the English stored on our
374: .UX
375: system.
376: This is the index searched by the
377: .I lookall
378: command.
379: On a typical day there were
380: 29,000 files in our user file system, containing about 152,000,000
381: bytes.
382: Of these
383: 5,300 files, containing 32,000,000 bytes (about 21%)
384: were English text.
385: The total number of `words' (determined mechanically)
386: was 5,100,000.
387: Of these 227,000 were selected as keys;
388: 19,000 were distinct, hashing to 4,900 (of 5,000 possible) different hash codes.
389: The
390: resulting inverted file indexes used 845,000 bytes, or about
391: 2.6% of the size of the original files.
392: The particularly small indexes are caused by the
393: fact that keys are taken from only the first 50 non-common words of
394: some very long input files.
395: .PP
396: Even this large \f2lookall\f1 index can be searched quickly.
397: For example, to find this document
398: by looking for the keys
399: ``lesk inverted indexes''
400: required
401: 1.7 seconds of processor time
402: and system time.
403: By comparison, just to search the 800,000 byte dictionary (smaller than even
404: the inverted indexes, let alone the 27,000,000 bytes of text files) with
405: .I grep
406: takes 29 seconds of processor time.
407: The
408: .I lookall
409: program is thus useful when looking for a document which you believe
410: is stored on-line, but do not know where. For example, many memos
411: from our center are in the file system, but it is often
412: difficult to guess where a particular memo might be (it might have several
413: authors, each with many directories, and have been worked on by
414: a secretary with yet more directories).
415: Instructions for the use of the
416: .I lookall
417: command are given in the manual section, shown
418: in the appendix to this memorandum.
419: .PP
420: The only indexes maintained routinely are those of publication lists and
421: all English files.
422: To make other indexes, the programs for making keys, sorting them,
423: searching the indexes, and delivering answers must be used.
424: Since they are usually invoked as parts of higher-level commands,
425: they are not in the default command
426: directory, but are available to any user in the directory
427: .I /usr/lib/refer .
428: Three programs are of interest:
429: .I mkey ,
430: which isolates keys from input files;
431: .I inv ,
432: which makes an index from a set of keys;
433: and
434: .I hunt ,
435: which searches the index and delivers the items.
436: Note that the two parts of the retrieval phase are combined into
437: one program, to avoid the excessive system work and delay which
438: would result from running these as separate processes.
439: .PP
440: These three commands have a large number of options to adapt to different
441: kinds of input.
442: The user not interested in the detailed description that now follows may
443: skip to section 3, which describes the
444: .I refer
445: program, a packaged-up version of these tools specifically
446: oriented towards formatting references.
447: .PP
448: .B
449: Make Keys.
450: .R
451: The program
452: .I mkey
453: is the key-making program corresponding to step (1) in phase A.
454: Normally, it reads its input from the file names given as arguments,
455: and if there are no arguments it reads from the standard input.
456: It assumes that blank lines in the input delimit
457: separate items, for each of which a different line of
458: keys should be generated.
459: The lines of keys are written on the standard output.
460: Keys are any alphanumeric string in the input not
461: among the most frequent words in English and not entirely numeric
462: (except that all-numeric strings are acceptable if they
463: are between 1900 and 1999).
464: In the output, keys are translated to lower case, and truncated
465: to six characters in length; any associated punctuation is removed.
466: The following flag arguments are recognized by
467: .I mkey:
468: .TS
469: center;
470: lB lw(4i).
471: \-c \f2name T{
472: Name of file of common words;
473: default is
474: .I /usr/lib/eign.
475: T}
476: \-f \f2name T{
477: Read a list of files from
478: .I name
479: and take each as an input argument.
480: T}
481: \-i \f2chars T{
482: Ignore all lines which begin with `%' followed by any character
483: in
484: .I chars .
485: T}
486: \-k\f2n T{
487: Use at most
488: .I n
489: keys per input item.
490: T}
491: \-l\f2n T{
492: Ignore items shorter than
493: .I n
494: letters long.
495: T}
496: \-n\f2m T{
497: Ignore as a key any word in the first
498: .I m
499: words of the list of common English words.
500: The default is 100.
501: T}
502: \-s T{
503: Remove the labels
504: .I (file:start,length)
505: from the output; just give the keys.
506: Used when searching rather than indexing.
507: T}
508: \-w T{
509: Each whole file is a separate item;
510: blank lines in files are irrelevant.
511: T}
512: .TE
513: .PP
514: The normal arguments for indexing references are
515: the defaults, which are
516: .I "\-c /usr/lib/eign" ,
517: .I \-n100 ,
518: and
519: .I \-l3 .
520: For searching, the
521: .I \-s
522: option is also needed.
523: When the big
524: .I lookall
525: index of all English files is run,
526: the options are
527: .I \-w ,
528: .I \-k50 ,
529: and
530: .I "\-f (filelist)" .
531: When running on textual input,
532: the
533: .I mkey
534: program processes about 1000 English words per processor second.
535: Unless the
536: .I \-k
537: option is used (and the input files are long enough for
538: it to take effect)
539: the output of
540: .I mkey
541: is comparable in size to its input.
542: .PP
543: .B
544: Hash and invert.
545: .R
546: The
547: .I inv
548: program computes the hash codes and writes
549: the inverted files.
550: It reads the output of
551: .I mkey
552: and writes the set of files described earlier
553: in this section.
554: It expects one argument, which is used as the base name for
555: the three (or four) files to be written.
556: Assuming an argument of
557: .I Index
558: (the default)
559: the entry file is named
560: .I Index.ia ,
561: the posting file
562: .I Index.ib ,
563: the tag file
564: .I Index.ic ,
565: and the key file (if present)
566: .I Index.id .
567: The
568: .I inv
569: program recognizes the following options:
570: .TS
571: center;
572: lB lw(4i).
573: \-a T{
574: Append the new keys to a previous set of inverted files,
575: making new files if there is no old set using the same base name.
576: T}
577: \-d T{
578: Write the optional key file.
579: This is needed when you can not check for false drops by looking
580: for the keys in the original inputs, i.e. when the key derivation
581: procedure is complicated and
582: the output keys are not words from the input files.
583: T}
584: \-h\f2n T{
585: The hash table size is
586: .I n
587: (default 997);
588: .I n
589: should be prime.
590: Making \f2n\f1 bigger saves search time and spends disk space.
591: T}
592: \-i[u] \f2name T{
593: Take input from file
594: .I name ,
595: instead of the standard input;
596: if
597: .B u
598: is present
599: .I name
600: is unlinked when the sort is started.
601: Using this option permits the sort scratch space
602: to overlap the disk space used for input keys.
603: T}
604: \-n T{
605: Make a completely new set of inverted files, ignoring
606: previous files.
607: T}
608: \-p T{
609: Pipe into the sort program, rather than writing a temporary
610: input file.
611: This saves disk space and spends processor time.
612: T}
613: \-v T{
614: Verbose mode; print a summary of the number of keys which
615: finished indexing.
616: T}
617: .TE
618: .PP
619: About half the time used in
620: .I inv
621: is in the contained sort.
622: Assuming the sort is roughly linear, however,
623: a guess at the total timing for
624: .I inv
625: is 250 keys per second.
626: The space used is usually of more importance:
627: the entry file uses four bytes per possible hash (note
628: the
629: .B \-h
630: option),
631: and the tag file around 15-20 bytes per item indexed.
632: Roughly, the posting file contains one item for each key instance
633: and one item for each possible hash code; the items are two bytes
634: long if the tag file is less than 65336 bytes long, and the
635: items are four bytes wide if the tag file is greater than
636: 65536 bytes long.
637: Note that to minimize storage, the hash tables should be
638: over-full;
639: for most of the files indexed in this way, there is no
640: other real choice, since the
641: .I entry
642: file must fit in memory.
643: .PP
644: .B
645: Searching and Retrieving.
646: .R
647: The
648: .I hunt
649: program retrieves items from an index.
650: It combines, as mentioned above, the two parts of phase (B):
651: search and delivery.
652: The reason why it is efficient to combine delivery and search
653: is partly to avoid starting unnecessary processes, and partly
654: because the delivery operation must be a part of the search
655: operation in any case.
656: Because of the hashing, the search part takes place in two stages:
657: first items are retrieved which have the right hash codes associated with them,
658: and then the actual items are inspected to determine false drops, i.e.
659: to determine if anything with the right hash codes doesn't really have the right
660: keys.
661: Since the original item is retrieved to check on false drops,
662: it is efficient to present it immediately, rather than only
663: giving the tag as output and later retrieving the
664: item again.
665: If there were a separate key file, this argument would not apply,
666: but separate key files are not common.
667: .PP
668: Input to
669: .I hunt
670: is taken from the standard input,
671: one query per line.
672: Each query should be in
673: .I "mkey \-s"
674: output format;
675: all lower case, no punctuation.
676: The
677: .I hunt
678: program takes one argument which specifies the base name of the index
679: files to be searched.
680: Only one set of index files can be searched at a time,
681: although many text files may be indexed as a group, of course.
682: If one of the text files has been changed since the index, that file
683: is searched with
684: .I fgrep;
685: this may occasionally slow down the searching, and care should be taken to
686: avoid having many out of date files.
687: The following option arguments are recognized by
688: .I hunt:
689: .TS
690: center;
691: lB lw(4i).
692: \-a T{
693: Give all output; ignore checking for false drops.
694: T}
695: \-C\f2n T{
696: Coordination level
697: .I n;
698: retrieve items with not more than
699: .I n
700: terms of the input missing;
701: default
702: .I C0 ,
703: implying that each search term must be in the output items.
704: T}
705: \-F[yn\f2d\f3\|] T{
706: ``\-Fy'' gives the text of all the items found;
707: ``\-Fn'' suppresses them.
708: ``\-F\f2d\|\f1'' where \f2d\f1\| is an integer
709: gives the text of the first \f2d\f1 items.
710: The default is
711: .I \-Fy.
712: T}
713: \-g T{
714: Do not use
715: .I fgrep
716: to search files changed since the index was made;
717: print an error comment instead.
718: T}
719: \-i \f2string T{
720: Take
721: .I string
722: as input, instead of reading the standard input.
723: T}
724: \-l \f2n T{
725: The maximum length of internal lists of candidate
726: items is
727: .I n;
728: default 1000.
729: T}
730: \-o \f2string T{
731: Put text output (``\-Fy'') in
732: .I string;
733: of use
734: .I only
735: when
736: invoked from another program.
737: T}
738: \-p T{
739: Print hash code frequencies; mostly
740: for use in optimizing hash table sizes.
741: T}
742: \-T[yn\f2d\|\f3] T{
743: ``\-Ty'' gives the tags of the items found;
744: ``\-Tn'' suppresses them.
745: ``\-T\f2d\f1\|'' where \f2d\f1\| is an integer
746: gives the first \f2d\f1 tags.
747: The default is
748: .I \-Tn .
749: T}
750: \-t \f2string T{
751: Put tag output (``\-Ty'') in
752: .I string;
753: of use
754: .I only
755: when invoked from another program.
756: T}
757: .TE
758: .PP
759: The timing of
760: .I hunt
761: is complex.
762: Normally the hash table is overfull, so that there will
763: be many false drops on any single term;
764: but a multi-term query will have few false drops on
765: all terms.
766: Thus if a query is underspecified (one search term)
767: many potential items will be examined and discarded as false
768: drops, wasting time.
769: If the query is overspecified (a dozen search terms)
770: many keys will be examined only to verify that
771: the single item under consideration has that key posted.
772: The variation of search time with number of keys is
773: shown in the table below.
774: Queries of varying length were constructed to retrieve
775: a particular document from the file of references.
776: In the sequence to the left, search terms were chosen so as
777: to select the desired paper as quickly as possible.
778: In the sequence on the right, terms were chosen inefficiently,
779: so that the query did not uniquely select the desired document
780: until four keys had been used.
781: The same document was the target in each case,
782: and the final set of eight keys are also identical; the differences
783: at five, six and seven keys are produced by measurement error, not
784: by the slightly different key lists.
785: .TS
786: center;
787: c s s s5 | c s s s
788: cp8 cp8 cp8 cp8 | cp8 cp8 cp8 cp8
789: cp8 cp8 cp8 cp8 | cp8 cp8 cp8 cp8
790: n n n n | n n n n .
791: Efficient Keys Inefficient Keys
792: No. keys Total drops Retrieved Search time No. keys Total drops Retrieved Search time
793: (incl. false) Documents (seconds) (incl. false) Documents (seconds)
794: 1 15 3 1.27 1 68 55 5.96
795: 2 1 1 0.11 2 29 29 2.72
796: 3 1 1 0.14 3 8 8 0.95
797: 4 1 1 0.17 4 1 1 0.18
798: 5 1 1 0.19 5 1 1 0.21
799: 6 1 1 0.23 6 1 1 0.22
800: 7 1 1 0.27 7 1 1 0.26
801: 8 1 1 0.29 8 1 1 0.29
802: .TE
803: As would be expected, the optimal search is achieved
804: when the query just specifies the answer; however,
805: overspecification is quite cheap.
806: Roughly, the time required by
807: .I hunt
808: can be approximated as
809: 30 milliseconds per search key plus 75 milliseconds
810: per dropped document (whether it is a false drop or
811: a real answer).
812: In general, overspecification can be recommended;
813: it protects the user against additions to the data base
814: which turn previously uniquely-answered queries
815: into ambiguous queries.
816: .PP
817: The careful reader will have noted an enormous discrepancy between these times
818: and the earlier quoted time of around 1.9 seconds for a search. The times
819: here are purely for the search and retrieval: they are measured by
820: running many searches through a single invocation of the
821: .I hunt
822: program alone.
823: The normal retrieval operation involves using the shell to
824: set up a pipeline through
825: .I mkey
826: to
827: .I hunt
828: and starting both processes; this adds a fixed overhead of about 1.7 seconds
829: of processor time
830: to any single search.
831: Furthermore, remember that all these times are processor times:
832: on a typical morning on our \s-2PDP\s0 11/70 system, with about one dozen
833: people logged on,
834: to obtain 1 second of processor time for the search program
835: took between 2 and 12 seconds of real time, with a median of
836: 3.9 seconds and a mean of 4.8 seconds.
837: Thus, although the work involved in a single search may be only
838: 200 milliseconds, after you add the 1.7 seconds of startup processor
839: time
840: and then assume a 4:1 elapsed/processor time
841: ratio, it will be 8 seconds before any response is printed.
842: .NH
843: Selecting and Formatting References for T\s-2ROFF\s0
844: .PP
845: The major application of the retrieval software
846: is
847: .I refer,
848: which is a
849: .I troff
850: preprocessor
851: like
852: .I eqn .
853: .[
854: kernighan cherry acm 1975
855: .]
856: It scans its input looking for items of the form
857: .DS
858: \*.[
859: imprecise citation
860: \*.\^]
861: .DE
862: where an imprecise citation is merely a string
863: of words found in the relevant bibliographic citation.
864: This is translated into a properly formatted reference.
865: If the imprecise citation does not correctly identify
866: a single paper
867: (either
868: selecting no papers or too many) a message is given.
869: The data base of citations searched may be tailored to each
870: system, and individual users may specify their own
871: citation
872: files.
873: On our system, the default data base is accumulated from
874: the publication lists of the members of our organization, plus
875: about half a dozen personal bibliographies that were collected.
876: The present total is about 4300 citations, but this increases steadily.
877: Even now,
878: the data base covers a large fraction of local citations.
879: .PP
880: For example, the reference for the
881: .I eqn
882: paper above was specified as
883: .DS
884: \&\*.\*.\*.
885: \&preprocessor like
886: \&.I eqn.
887: \&.[
888: \&kernighan cherry acm 1975
889: \&.]
890: \&It scans its input looking for items
891: \&\*.\*.\*.
892: .DE
893: This paper was itself printed using
894: .I refer.
895: The above input text was processed by
896: .I refer
897: as well as
898: .I tbl
899: and
900: .I troff
901: by the command
902: .DS
903: .ft I
904: refer memo-file | tbl | troff \-ms
905: .ft R
906: .DE
907: and the reference was automatically translated into a correct
908: citation to the ACM paper on mathematical typesetting.
909: .PP
910: The procedure to use to place a reference in a paper
911: using
912: .I refer
913: is as follows.
914: First, use the
915: .I lookbib
916: command to check that the paper is in the data base
917: and to find out what keys are necessary to retrieve it.
918: This is done by typing
919: .I lookbib
920: and then typing some potential queries until
921: a suitable query is found.
922: For example, had one started to find
923: the
924: .I eqn
925: paper shown above by presenting the query
926: .DS
927: $ lookbib
928: kernighan cherry
929: (EOT)
930: .DE
931: .I lookbib
932: would have found several items; experimentation would quickly
933: have shown that the query given above is adequate.
934: Overspecifying the query is of course harmless.
935: A particularly careful reader may have noticed that ``acm'' does not
936: appear in the printed citation;
937: we have supplemented some of the data base items with common
938: extra keywords, such as common abbreviations for journals
939: or other sources, to aid in searching.
940: .PP
941: If the reference is in the data base, the query
942: that retrieved it can be inserted in the text,
943: between
944: .B \*.[
945: and
946: .B \*.\^]
947: brackets.
948: If it is not in the data base, it can be typed
949: into a private file of references, using the format
950: discussed in the next section, and then
951: the
952: .B \-p
953: option
954: used to search this private file.
955: Such a command might read
956: (if the private references are called
957: .I myfile )
958: .DS
959: .ft 2
960: refer \-p myfile document | tbl | eqn | troff \-ms \*. \*. \*.
961: .ft 1
962: .DE
963: where
964: .I tbl
965: and/or
966: .I eqn
967: could be omitted if not needed.
968: The use
969: of the
970: .I \-ms
971: macros
972: .[
973: lesk typing documents unix gcos
974: .]
975: or some other macro package, however,
976: is essential.
977: .I Refer
978: only generates the data for the references; exact formatting
979: is done by some macro package, and if none is supplied the
980: references will not be printed.
981: .PP
982: By default,
983: the references are numbered sequentially,
984: and
985: the
986: .I \-ms
987: macros format references as footnotes at the bottom of the page.
988: This memorandum is an example of that style.
989: Other possibilities are discussed in section 5 below.
990: .NH
991: Reference Files.
992: .PP
993: A reference file is a set of bibliographic references usable with
994: .I refer.
995: It can be indexed using the software described in section 2
996: for fast searching.
997: What
998: .I refer
999: does is to read the input document stream,
1000: looking for imprecise citation references.
1001: It then searches through reference files to find
1002: the full citations, and inserts them into the
1003: document.
1004: The format of the full citation is arranged to make it
1005: convenient for a macro package, such as the
1006: .I \-ms
1007: macros, to format the reference
1008: for printing.
1009: Since
1010: the format of the final reference is determined
1011: by the desired style of output,
1012: which is determined by the macros used,
1013: .I refer
1014: avoids forcing any kind of reference appearance.
1015: All it does is define a set of string registers which
1016: contain the basic information about the reference;
1017: and provide a macro call which is expanded by the macro
1018: package to format the reference.
1019: It is the responsibility of the final macro package
1020: to see that the reference is actually printed; if no
1021: macros are used, and the output of
1022: .I refer
1023: fed untranslated to
1024: .I troff,
1025: nothing at all will be printed.
1026: .PP
1027: The strings defined by
1028: .I refer
1029: are taken directly from the files of references, which
1030: are in the following format.
1031: The references should be separated
1032: by blank lines.
1033: Each reference is a sequence of lines beginning with
1034: .B %
1035: and followed
1036: by a key-letter.
1037: The remainder of that line, and successive lines until the next line beginning
1038: with
1039: .B % ,
1040: contain the information specified by the key-letter.
1041: In general,
1042: .I refer
1043: does not interpret the information, but merely presents
1044: it to the macro package for final formatting.
1045: A user with a separate macro package, for example,
1046: can add new key-letters or use the existing ones for other purposes
1047: without bothering
1048: .I refer.
1049: .PP
1050: The meaning of the key-letters given below, in particular,
1051: is that assigned by the
1052: .I \-ms
1053: macros.
1054: Not all information, obviously, is used with each citation.
1055: For example, if a document is both an internal memorandum and a journal article,
1056: the macros ignore the memorandum version and cite only the journal article.
1057: Some kinds of information are not used at all in printing the reference;
1058: if a user does not like finding references by specifying title
1059: or author keywords, and prefers to add specific keywords to the
1060: citation, a field is available which is searched but not
1061: printed (\f3K\f1).
1062: .PP
1063: The key letters currently recognized by
1064: .I refer
1065: and
1066: .I \-ms,
1067: with the kind of information implied, are:
1068: .KS
1069: .TS
1070: center;
1071: c c6 c c
1072: c l c l.
1073: Key Information specified Key Information specified
1074: A Author's name N Issue number
1075: B Title of book containing item O Other information
1076: C City of publication P Page(s) of article
1077: D Date R Technical report reference
1078: E Editor of book containing item T Title
1079: G Government (NTIS) ordering number V Volume number
1080: I Issuer (publisher)
1081: J Journal name
1082: K Keys (for searching) X or
1083: L Label Y or
1084: M Memorandum label Z Information not used by \f2refer\f1
1085: .TE
1086: .KE
1087: For example, a sample reference could be
1088: typed as:
1089: .DS
1090: %T Bounds on the Complexity of the Maximal
1091: Common Subsequence Problem
1092: %Z ctr127
1093: %A A. V. Aho
1094: %A D. S. Hirschberg
1095: %A J. D. Ullman
1096: %J J. ACM
1097: %V 23
1098: %N 1
1099: %P 1-12
1100: .\"%M TM 75-1271-7
1101: %M abcd-78
1102: %D Jan. 1976
1103: .DE
1104: Order is irrelevant, except that authors are shown in the order
1105: given. The output of
1106: .I refer
1107: is a stream of string definitions, one
1108: for each of the fields of each reference, as
1109: shown below.
1110: .DS
1111: \*.]-
1112: \*.ds [A authors' names \*.\*.\*.
1113: \*.ds [T title \*.\*.\*.
1114: \*.ds [J journal \*.\*.\*.
1115: \*.\*.\*.
1116: \*.]\|[ type-number
1117: .DE
1118: The special macro
1119: .B \&\*.]\-
1120: precedes the string definitions
1121: and the special macro
1122: .B \*.]\|[
1123: follows.
1124: These are changed from the input
1125: .B \*.[
1126: and
1127: .B \*.\^]
1128: so that running the same file through
1129: .I refer
1130: again is harmless.
1131: The
1132: .B \*.]\-
1133: macro can be used by the macro package to
1134: initialize.
1135: The
1136: .B \*.]\|[
1137: macro, which should be used
1138: to print the reference, is given an
1139: argument
1140: .I type-number
1141: to indicate the kind of reference, as follows:
1142: .KS
1143: .TS
1144: center;
1145: c c
1146: n l.
1147: Value Kind of reference
1148: 1 Journal article
1149: 2 Book
1150: 3 Article within book
1151: 4 Technical report
1152: 5 Bell Labs technical memorandum
1153: 0 Other
1154: .TE
1155: .KE
1156: The reference is flagged in the text
1157: with the sequence
1158: .DS
1159: \e*\|([\*.number\e*\|(\*.\^]
1160: .DE
1161: where
1162: .I number
1163: is the footnote number.
1164: The strings
1165: .B [\*.
1166: and
1167: .B \*.\^]
1168: should be used by the macro package
1169: to format the reference flag in the text.
1170: These strings can be replaced for a particular
1171: footnote, as described in section 5.
1172: The footnote number (or other signal) is available
1173: to the reference macro
1174: .B \*.]\|[
1175: as the
1176: string register
1177: .B [F .
1178: .PP
1179: In some cases users wish to suspend the searching, and merely
1180: use the reference macro formatting.
1181: That is, the user doesn't want to provide a search key
1182: between
1183: .B \*.[
1184: and
1185: .B \*.\^]
1186: brackets, but merely
1187: the reference lines for the appropriate document.
1188: Alternatively, the user
1189: can wish
1190: to add a few fields to those in the reference
1191: as in the standard file, or
1192: override some fields.
1193: Altering or replacing fields, or supplying whole references, is easily done
1194: by inserting lines beginning
1195: with
1196: .B % ;
1197: any such line is taken as
1198: direct input to the reference
1199: processor rather than keys to be searched.
1200: Thus
1201: .DS
1202: \*.[
1203: key1 key2 key3 \*.\*.\*.
1204: %Q New format item
1205: %R Override report name
1206: \*.\^]
1207: .DE
1208: makes the indicated changes to the result of searching for
1209: the keys.
1210: All of the search keys must be given before the first
1211: \f3%\f1 line.
1212: .PP
1213: If no search keys are provided, an entire citation can
1214: be provided in-line in the text.
1215: For example, if the
1216: .I eqn
1217: paper citation were to be inserted in
1218: this way, rather than by searching for it in the data base,
1219: the input would read
1220: .DS
1221: \&\*.\*.\*.
1222: \&preprocessor like
1223: \&.I eqn.
1224: \&.[
1225: \&%A B. W. Kernighan
1226: \&%A L. L. Cherry
1227: \&%T A System for Typesetting Mathematics
1228: \&%J Comm. ACM
1229: \&%V 18
1230: \&%N 3
1231: \&%P 151-157
1232: \&%D March 1975
1233: \&.]
1234: \&It scans its input looking for items
1235: \&\*.\*.\*.
1236: .DE
1237: This would produce a citation of the same appearance as that
1238: resulting from the file search.
1239: .PP
1240: As shown, fields are normally turned into
1241: .I troff
1242: strings.
1243: Sometimes users would rather have them defined as macros,
1244: so that other
1245: .I troff
1246: commands can be placed into the data.
1247: When this is necessary, simply double the control character
1248: .B %
1249: in the data.
1250: Thus the input
1251: .DS
1252: \&.[
1253: %V 23
1254: %%M
1255: Bell Laboratories,
1256: Murray Hill, N.J. 07974
1257: \&.]
1258: .DE
1259: is processed by
1260: .I refer
1261: into
1262: .DS
1263: \&.ds [V 23
1264: \&.de [M
1265: Bell Laboratories,
1266: Murray Hill, N.J. 07974
1267: \&..
1268: .DE
1269: The information after
1270: .B %%M
1271: is defined as a macro to be invoked by
1272: .B .[M
1273: while the information after
1274: .B %V
1275: is turned into a string to be invoked by
1276: .B \e\(**([V .
1277: At present
1278: .I \-ms
1279: expects all information as strings.
1280: .NH
1281: Collecting References and other Refer Options
1282: .PP
1283: Normally, the combination of
1284: .I refer
1285: and
1286: .I \-ms
1287: formats output as
1288: .I troff
1289: footnotes which are consecutively numbered and placed
1290: at the bottom of the page. However,
1291: options exist to
1292: place the references at the end; to arrange references alphabetically
1293: by senior author; and to indicate references by strings in the text of the form
1294: [Name1975a]
1295: rather than by number.
1296: Whenever references are not placed at the bottom of a page
1297: identical references are coalesced.
1298: .PP
1299: For example, the
1300: .B \-e
1301: option to
1302: .I refer
1303: specifies that references are to be collected; in this case
1304: they are output whenever the sequence
1305: .DS
1306: \*.[
1307: $LIST$
1308: \*.\^]
1309: .DE
1310: is encountered.
1311: Thus, to place references at the end of a paper, the user would run
1312: .I refer
1313: with the
1314: .I \-e
1315: option and place the above $LIST$ commands after the last
1316: line of the text.
1317: .I Refer
1318: will then move all the references to that point.
1319: To aid in formatting the collected references,
1320: .I refer
1321: writes the references preceded by the line
1322: .DS
1323: .B .]<
1324: .DE
1325: and
1326: followed by the line
1327: .DS
1328: .B .]>
1329: .DE
1330: to invoke special macros before and after the references.
1331: .PP
1332: Another possible option to
1333: .I refer
1334: is the
1335: .B \-s
1336: option to specify
1337: sorting of references. The default,
1338: of course, is to list references in the order presented.
1339: The
1340: .B \-s
1341: option implies the
1342: .B \-e
1343: option, and thus requires
1344: a
1345: .DS
1346: \*.[
1347: $LIST$
1348: \*.\^]
1349: .DE
1350: entry to call out the reference list.
1351: The
1352: .B \-s
1353: option may be followed by a string of letters, numbers, and `+' signs indicating how
1354: the references are to be sorted.
1355: The sort is done using the fields whose key-letters are
1356: in the string as sorting keys; the numbers indicate how many
1357: of the fields are to be considered, with `+'
1358: taken as a large number.
1359: Thus the default is
1360: .B \-sAD
1361: meaning ``Sort on senior author, then date.'' To
1362: sort on all authors and then title, specify
1363: .B \-sA+T .
1364: And to sort on two authors and then the journal,
1365: write
1366: .B \-sA2J .
1367: .PP
1368: Other options to
1369: .I refer
1370: change the signal or label inserted in the text for each reference.
1371: Normally these are just sequential numbers,
1372: and their exact placement (within brackets, as superscripts, etc.) is determined
1373: by the macro package.
1374: The
1375: .B \-l
1376: option replaces reference numbers by
1377: strings composed of the senior author's last name, the date,
1378: and a disambiguating letter.
1379: If a number follows the
1380: .B l
1381: as in
1382: .B \-l3
1383: only that many letters of the last name are used
1384: in the label string.
1385: To abbreviate the date as well the form
1386: \f3-l\f2m,n\f1
1387: shortens the last name to the
1388: first
1389: .I m
1390: letters and the date to the
1391: last
1392: .I n
1393: digits.
1394: For example, the option
1395: .B \-l3,2
1396: would refer to the
1397: .I eqn
1398: paper (reference 3) by the signal
1399: .I Ker75a ,
1400: since it is the first cited reference by Kernighan in 1975.
1401: .PP
1402: A user wishing to specify particular labels for
1403: a private bibliography may use the
1404: .B \-k
1405: option.
1406: Specifying
1407: \f3\-k\f2x\f1
1408: causes the field \f2x\f1 to be used as a label.
1409: The default is \f3L\f1.
1410: If this field ends in \f3\-\f1, that character
1411: is replaced by a sequence letter; otherwise the field
1412: is used exactly as given.
1413: .PP
1414: If none of the
1415: .I refer -produced
1416: signals are desired,
1417: the
1418: .B \-b
1419: option entirely suppresses automatic text signals.
1420: .PP
1421: If the user wishes to override the
1422: .I \-ms
1423: treatment of the reference signal (which is normally to
1424: enclose the number in brackets in
1425: .I nroff
1426: and make it a superscript in
1427: .I troff\\| )
1428: this can be done easily.
1429: If the lines
1430: .B \&.[
1431: or
1432: .B \&.]
1433: contain anything following these characters,
1434: the remainders of these lines are used to surround
1435: the reference signal, instead of the default.
1436: Thus, for example, to say ``See reference (2).''
1437: and avoid
1438: ``See reference.\s-3\u2\d\s+3'' the
1439: input might appear
1440: .DS
1441: \&See reference
1442: \&\*.[ (
1443: imprecise citation ...
1444: \&\*.\^])\*.
1445: .DE
1446: Note that blanks are significant in this construction.
1447: If a permanent change is desired in the style of reference
1448: signals, however, it is probably easier to redefine the strings
1449: .B \&[.
1450: and
1451: .B \&.]
1452: (which are used to bracket each signal)
1453: than to change each citation.
1454: .PP
1455: Although normally
1456: .I refer
1457: limits itself to retrieving the data for the reference,
1458: and leaves to a macro package the job of arranging that
1459: data as required by the local format, there are two
1460: special options for rearrangements that can not be
1461: done by macro packages.
1462: The
1463: .B \-c
1464: option puts fields into all upper case
1465: (C\s-2APS\s+2-S\s-2MALL\s+2 C\s-2APS\s+2
1466: in
1467: .I troff
1468: output).
1469: The key-letters indicated what information is to be translated
1470: to upper case follow the
1471: .B c ,
1472: so that
1473: .B \-cAJ
1474: means that authors' names and journals are to be in caps.
1475: The
1476: .B \-a
1477: option writes the names of authors last name first, that is
1478: .I "A. D. Hall, Jr."
1479: is written as
1480: .I "Hall, A. D. Jr" .
1481: The citation form of
1482: the
1483: .I "Journal of the ACM" ,
1484: for example, would require
1485: both
1486: .B \-cA
1487: and
1488: .B \-a
1489: options.
1490: This produces authors' names in the style
1491: .I
1492: K\s-2ERNIGHAN\s0, B. W. \s-2AND\s0 C\s-2HERRY\s0, L. L.\&
1493: .R
1494: for the previous example.
1495: The
1496: .B \-a
1497: option may be followed by a number to indicate how many
1498: author names should be reversed;
1499: .B \-a1
1500: (without any
1501: .B \-c
1502: option)
1503: would produce
1504: .I
1505: Kernighan, B. W. and L. L. Cherry,
1506: .R
1507: for example.
1508: .PP
1509: Finally, there is also the previously-mentioned
1510: .B \-p
1511: option to let the user specify
1512: a private file of references to be searched before the public files.
1513: Note that
1514: .I refer
1515: does not insist on a previously made index for these files.
1516: If a file is named which contains reference
1517: data but is not indexed, it will be searched
1518: (more slowly)
1519: by
1520: .I refer
1521: using
1522: .I fgrep.
1523: In this way
1524: it is easy for users to keep small files of
1525: new references, which can later be added to the
1526: public data bases.
1527: .SG MH-1274-MEL-\s8UNIX\s0
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.