|
|
1.1 ! root 1: # How long a time lies in one little word! -- Shakespeare, Richard II, I, iii ! 2: ! 3: # Fine words butter no parsnips. -- Southern proverb ! 4: ! 5: [ef]?grep Implementation Changes ! 6: ! 7: 'grep' r.e. translation: ! 8: ! 9: To buy speed for the novice 'grep' user who deigns not to learn the ! 10: extended 'egrep' syntax, we translate 'grep' r.e.'s to the 'egrep' superset. ! 11: It is straightforward enough to surround search patterns meaningful to ! 12: 'egrep' but not to 'grep'. Odd cases include the -w option, not implemented ! 13: in standard 'egrep', the defunct -y option, and "tagged expressions", which ! 14: are done via an exec() of /bin/grep. Tagged exprs, like ! 15: ! 16: grep '\(....\).*\1' /usr/dict/words ! 17: ! 18: which outputs chaff like "beriberi", "couscous", "hodgepodge", and ! 19: "lightweight", are weird. The irregularity these exprs lend coupled with ! 20: a low complexity/utility ratio kept them from being part of 'egrep'. ! 21: But for this feature, old 'grep' code could be thrown away. ! 22: ! 23: 'fgrep' improvement / (partial) unification: ! 24: ! 25: In the new release, we trap low-complexity disjunctions such as ! 26: ! 27: egrep 'boyer|moore' file ! 28: or ! 29: fgrep 'boyer\n ! 30: moore' file ! 31: ! 32: (or with "-f patfile" in place of the pattern) with a method to superimpose ! 33: the non-terminals within the Boyer/Moore table. When scanning text backwards, ! 34: other programming tricks short-circuit some tests against the pattern. ! 35: Sparing further details, which might make for a more formal writeup, it ! 36: suffices to say that although worst-case complexity here is O(Rn) with string ! 37: length 'n', and R == r.e. size, average-case for text is still sublinear. E.g. ! 38: ! 39: egrep 'silver|orange|purple' # hard-to-rhyme color test in eg.shell ! 40: ! 41: looks at ~55000 chars in /usr/dict/words, whereas (three) separate invocations ! 42: of egrep on the individual color words make the code look at ~40000 bytes per ! 43: word. Aho/Corrasick's 'fgrep', in contrast, must look at all 200KB in the ! 44: dictionary. The elegant "trie" construction within "fgrep" excels, however, ! 45: for medium/large R. An equally ambitious "reverse trie", supplementing our ! 46: extant "alternation mush", would attenuate worst-case behavior while preserving ! 47: low R speedup. We save the addition for another day. ! 48: ! 49: Since the syntax for [ef]grep is similar, we thought of making egrep ! 50: hand off to fgrep for sufficiently large metacharacter-free R, as there is no ! 51: strong reason to make the user conscious of the separate algorithms. Certain ! 52: technicalities prevent this. For one, we are not willing to invent an 'egrep' ! 53: option to inform the code to interpret a file of (say a hundred) word ! 54: alternatives containing some innocent metacharacter, that it is literal ! 55: 'fgrep' input, rather than a closure-containing 'egrep' pattern which would ! 56: otherwise make egrep explode. More work could be done here. ! 57: ! 58: Our motivation? Is this not all overblown? Perhaps, but now you can ! 59: build a simple fast "NSA filter", or search for the seven dwarfs at leisure. ! 60: Besides, the final nail needed to be driven into 'bm/match', which tried ! 61: to do parallel match, but actually shuffled things out of order during its ! 62: simplistic block-based scheme. These programs, part of source archive also, ! 63: are now historical curiosities. ! 64: ! 65: Kanji egrep: ! 66: ! 67: Copious notes are in README.kanji.mods. The March 1987 Unix Review ! 68: was indispensable for pointing out the embedded "SS2" Katakana pitfalls. ! 69: The modularity of our code as a semi-dependent filter was necessary for this ! 70: exploration, as we have no access to AT&T/Unisoft Kanji code. Again, JUNET ! 71: or Sigma project people -- please respond with grep war stories or usage notes. ! 72: ! 73: Worst-case r.e. handling: ! 74: ! 75: The first code release elaborately called upon a function mcilroy() ! 76: to pipe partial match output to old egrep for tough expressions, whose ! 77: backtracking might swamp regexp(). Some details of the DFA/NDFA tradeoff ! 78: were discussed in pep4grep.doc[12]. Due largely to feedback from John Gilmore ! 79: of the GNU project, the strategy was revised. egrep.c function kernighan() ! 80: ("let others do the hard part") now usurps calls to costly popen() by invoking ! 81: exec() on old egrep when necessary. Rough partial match statistics gathered ! 82: on the fly determine the handoff. You may revise the time reported previously ! 83: for ! 84: egrep 'hoe.*g' /usr/dict/words ! 85: ! 86: from 1.2 user, 1.1 sys seconds (VAX 11/750, 4.3BSD, Fuji disks) to 0.8u, 0.4s. ! 87: For those public-spirited souls who really want to build a PD egrep out of ! 88: what we offer, sans fallback from regexp() to an AT&T /usr/bin/egrep, the ! 89: slippery test "egrep 'g+h+o' /usr/dict/words" will prove enlightening. ! 90: ! 91: Faster -n option: ! 92: ! 93: By popular demand. Though Boyer/Moore techniques subvert line numbering, ! 94: we've made it faster with brute force (loop unrolling helps VAXEN, but not ! 95: CRISPS). Timing tests for this and other options appear in the eg.shell script. ! 96: ! 97: Not so fast: ! 98: ! 99: -v, -b, -w, various r.e.'s with no rexexp() "residue" ! 100: ! 101: (you'll still have to use the layered "grep host /etc/hosts | grep -w host" ! 102: for speed.) ! 103: ! 104: Other contra-indications for new [ef]?grep: ! 105: ! 106: Monster patterns ! 107: ! 108: The amazing expressions output by /usr/lib/calendar still beg for ! 109: the lazy evaluation technique rolled into edition 8 egrep by Prof. Aho of ! 110: Princeton. Hinted at on p. 105 in Kernighan & Pike, lazy evaluation reduces ! 111: standard egrep r.e. compile time. Here the possible O(R**2) machine ! 112: construction cost is eliminated to amortize complexity at run-time and ! 113: shifted to such only if a bad match actually happens. Whew! Fortunately, ! 114: this is not so important for simple r.e. fare, where H. Spencer's regexp() ! 115: works well, but it certainly helps calendar(1). ! 116: ! 117: The catch with lazy eval. is that it slows down simple matching (15-20% ! 118: for /usr/dict/words on VAX), so it hasn't been adopted by System V egrep. ! 119: Note that our egrep, deferring to the underlying one in /usr/bin, doesn't ! 120: care much about these hideous beasts -- it just doesn't do better on them. ! 121: However, [ef]?grep does well by the Kadhafi matcher (eg.shell, again). ! 122: ! 123: Long lines, small alphabets ! 124: ! 125: Finally, a comment on one rapidly burgeoning application area ! 126: where new egrep should not be blindly proscribed -- genome sequencing. ! 127: Though line limits have been raised (to 8192 byte buffers), much of ! 128: GENBANK has no newlines. The code would need modification for scanning ! 129: freestyle. Also, locating ACGT sequences with the current "superimposed BMG" ! 130: over a 4-letter alphabet might actually be worse, but the global homology ! 131: crowd probably uses a >20 letter protein alphabet (for other reasons). ! 132: At any rate, genetic string search generally relies on more sophisticated ! 133: methods such as dynamic programming ala Sankoff/Kruskal. ! 134: ! 135: On the other hand, large alphabets such as Kanji probably help ! 136: performance. As do parallel transfer disks, MACH file mapping, ... ! 137: Your suggestions welcome. ! 138: ! 139: James A. Woods (ames!jaw) ! 140: NASA Ames Research Center ! 141: ! 142: P.S. Preserving author credit, [ef]?grep may be redistributed as you wish.
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.