|
|
1.1 root 1: # How long a time lies in one little word! -- Shakespeare, Richard II, I, iii
2:
3: # Fine words butter no parsnips. -- Southern proverb
4:
5: [ef]?grep Implementation Changes
6:
7: 'grep' r.e. translation:
8:
9: To buy speed for the novice 'grep' user who deigns not to learn the
10: extended 'egrep' syntax, we translate 'grep' r.e.'s to the 'egrep' superset.
11: It is straightforward enough to surround search patterns meaningful to
12: 'egrep' but not to 'grep'. Odd cases include the -w option, not implemented
13: in standard 'egrep', the defunct -y option, and "tagged expressions", which
14: are done via an exec() of /bin/grep. Tagged exprs, like
15:
16: grep '\(....\).*\1' /usr/dict/words
17:
18: which outputs chaff like "beriberi", "couscous", "hodgepodge", and
19: "lightweight", are weird. The irregularity these exprs lend coupled with
20: a low complexity/utility ratio kept them from being part of 'egrep'.
21: But for this feature, old 'grep' code could be thrown away.
22:
23: 'fgrep' improvement / (partial) unification:
24:
25: In the new release, we trap low-complexity disjunctions such as
26:
27: egrep 'boyer|moore' file
28: or
29: fgrep 'boyer\n
30: moore' file
31:
32: (or with "-f patfile" in place of the pattern) with a method to superimpose
33: the non-terminals within the Boyer/Moore table. When scanning text backwards,
34: other programming tricks short-circuit some tests against the pattern.
35: Sparing further details, which might make for a more formal writeup, it
36: suffices to say that although worst-case complexity here is O(Rn) with string
37: length 'n', and R == r.e. size, average-case for text is still sublinear. E.g.
38:
39: egrep 'silver|orange|purple' # hard-to-rhyme color test in eg.shell
40:
41: looks at ~55000 chars in /usr/dict/words, whereas (three) separate invocations
42: of egrep on the individual color words make the code look at ~40000 bytes per
43: word. Aho/Corrasick's 'fgrep', in contrast, must look at all 200KB in the
44: dictionary. The elegant "trie" construction within "fgrep" excels, however,
45: for medium/large R. An equally ambitious "reverse trie", supplementing our
46: extant "alternation mush", would attenuate worst-case behavior while preserving
47: low R speedup. We save the addition for another day.
48:
49: Since the syntax for [ef]grep is similar, we thought of making egrep
50: hand off to fgrep for sufficiently large metacharacter-free R, as there is no
51: strong reason to make the user conscious of the separate algorithms. Certain
52: technicalities prevent this. For one, we are not willing to invent an 'egrep'
53: option to inform the code to interpret a file of (say a hundred) word
54: alternatives containing some innocent metacharacter, that it is literal
55: 'fgrep' input, rather than a closure-containing 'egrep' pattern which would
56: otherwise make egrep explode. More work could be done here.
57:
58: Our motivation? Is this not all overblown? Perhaps, but now you can
59: build a simple fast "NSA filter", or search for the seven dwarfs at leisure.
60: Besides, the final nail needed to be driven into 'bm/match', which tried
61: to do parallel match, but actually shuffled things out of order during its
62: simplistic block-based scheme. These programs, part of source archive also,
63: are now historical curiosities.
64:
65: Kanji egrep:
66:
67: Copious notes are in README.kanji.mods. The March 1987 Unix Review
68: was indispensable for pointing out the embedded "SS2" Katakana pitfalls.
69: The modularity of our code as a semi-dependent filter was necessary for this
70: exploration, as we have no access to AT&T/Unisoft Kanji code. Again, JUNET
71: or Sigma project people -- please respond with grep war stories or usage notes.
72:
73: Worst-case r.e. handling:
74:
75: The first code release elaborately called upon a function mcilroy()
76: to pipe partial match output to old egrep for tough expressions, whose
77: backtracking might swamp regexp(). Some details of the DFA/NDFA tradeoff
78: were discussed in pep4grep.doc[12]. Due largely to feedback from John Gilmore
79: of the GNU project, the strategy was revised. egrep.c function kernighan()
80: ("let others do the hard part") now usurps calls to costly popen() by invoking
81: exec() on old egrep when necessary. Rough partial match statistics gathered
82: on the fly determine the handoff. You may revise the time reported previously
83: for
84: egrep 'hoe.*g' /usr/dict/words
85:
86: from 1.2 user, 1.1 sys seconds (VAX 11/750, 4.3BSD, Fuji disks) to 0.8u, 0.4s.
87: For those public-spirited souls who really want to build a PD egrep out of
88: what we offer, sans fallback from regexp() to an AT&T /usr/bin/egrep, the
89: slippery test "egrep 'g+h+o' /usr/dict/words" will prove enlightening.
90:
91: Faster -n option:
92:
93: By popular demand. Though Boyer/Moore techniques subvert line numbering,
94: we've made it faster with brute force (loop unrolling helps VAXEN, but not
95: CRISPS). Timing tests for this and other options appear in the eg.shell script.
96:
97: Not so fast:
98:
99: -v, -b, -w, various r.e.'s with no rexexp() "residue"
100:
101: (you'll still have to use the layered "grep host /etc/hosts | grep -w host"
102: for speed.)
103:
104: Other contra-indications for new [ef]?grep:
105:
106: Monster patterns
107:
108: The amazing expressions output by /usr/lib/calendar still beg for
109: the lazy evaluation technique rolled into edition 8 egrep by Prof. Aho of
110: Princeton. Hinted at on p. 105 in Kernighan & Pike, lazy evaluation reduces
111: standard egrep r.e. compile time. Here the possible O(R**2) machine
112: construction cost is eliminated to amortize complexity at run-time and
113: shifted to such only if a bad match actually happens. Whew! Fortunately,
114: this is not so important for simple r.e. fare, where H. Spencer's regexp()
115: works well, but it certainly helps calendar(1).
116:
117: The catch with lazy eval. is that it slows down simple matching (15-20%
118: for /usr/dict/words on VAX), so it hasn't been adopted by System V egrep.
119: Note that our egrep, deferring to the underlying one in /usr/bin, doesn't
120: care much about these hideous beasts -- it just doesn't do better on them.
121: However, [ef]?grep does well by the Kadhafi matcher (eg.shell, again).
122:
123: Long lines, small alphabets
124:
125: Finally, a comment on one rapidly burgeoning application area
126: where new egrep should not be blindly proscribed -- genome sequencing.
127: Though line limits have been raised (to 8192 byte buffers), much of
128: GENBANK has no newlines. The code would need modification for scanning
129: freestyle. Also, locating ACGT sequences with the current "superimposed BMG"
130: over a 4-letter alphabet might actually be worse, but the global homology
131: crowd probably uses a >20 letter protein alphabet (for other reasons).
132: At any rate, genetic string search generally relies on more sophisticated
133: methods such as dynamic programming ala Sankoff/Kruskal.
134:
135: On the other hand, large alphabets such as Kanji probably help
136: performance. As do parallel transfer disks, MACH file mapping, ...
137: Your suggestions welcome.
138:
139: James A. Woods (ames!jaw)
140: NASA Ames Research Center
141:
142: P.S. Preserving author credit, [ef]?grep may be redistributed as you wish.
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.