Annotation of 43BSDReno/usr.bin/grep/README/pep4grep.doc4, revision 1.1

1.1     ! root        1: #  How long a time lies in one little word! -- Shakespeare, Richard II, I, iii
        !             2: 
        !             3: #  Fine words butter no parsnips. -- Southern proverb
        !             4: 
        !             5:                [ef]?grep Implementation Changes
        !             6: 
        !             7: 'grep' r.e. translation:
        !             8: 
        !             9:      To buy speed for the novice 'grep' user who deigns not to learn the
        !            10: extended 'egrep' syntax, we translate 'grep' r.e.'s to the 'egrep' superset.
        !            11: It is straightforward enough to surround search patterns meaningful to
        !            12: 'egrep' but not to 'grep'.  Odd cases include the -w option, not implemented
        !            13: in standard 'egrep', the defunct -y option, and "tagged expressions", which
        !            14: are done via an exec() of /bin/grep.  Tagged exprs, like
        !            15: 
        !            16:        grep '\(....\).*\1' /usr/dict/words
        !            17: 
        !            18: which outputs chaff like "beriberi", "couscous", "hodgepodge", and
        !            19: "lightweight", are weird.  The irregularity these exprs lend coupled with
        !            20: a low complexity/utility ratio kept them from being part of 'egrep'.
        !            21: But for this feature, old 'grep' code could be thrown away.
        !            22: 
        !            23: 'fgrep' improvement / (partial) unification:
        !            24: 
        !            25:      In the new release, we trap low-complexity disjunctions such as
        !            26: 
        !            27:                egrep 'boyer|moore' file
        !            28: or
        !            29:                fgrep 'boyer\n
        !            30:                moore' file
        !            31: 
        !            32: (or with "-f patfile" in place of the pattern) with a method to superimpose
        !            33: the non-terminals within the Boyer/Moore table.  When scanning text backwards,
        !            34: other programming tricks short-circuit some tests against the pattern.
        !            35: Sparing further details, which might make for a more formal writeup, it
        !            36: suffices to say that although worst-case complexity here is O(Rn) with string
        !            37: length 'n', and R == r.e. size, average-case for text is still sublinear.  E.g.
        !            38: 
        !            39:        egrep 'silver|orange|purple'    # hard-to-rhyme color test in eg.shell
        !            40: 
        !            41: looks at ~55000 chars in /usr/dict/words, whereas (three) separate invocations
        !            42: of egrep on the individual color words make the code look at ~40000 bytes per
        !            43: word.  Aho/Corrasick's 'fgrep', in contrast, must look at all 200KB in the
        !            44: dictionary.  The elegant "trie" construction within "fgrep" excels, however,
        !            45: for medium/large R.  An equally ambitious "reverse trie", supplementing our
        !            46: extant "alternation mush", would attenuate worst-case behavior while preserving
        !            47: low R speedup.  We save the addition for another day.
        !            48: 
        !            49:      Since the syntax for [ef]grep is similar, we thought of making egrep
        !            50: hand off to fgrep for sufficiently large metacharacter-free R, as there is no
        !            51: strong reason to make the user conscious of the separate algorithms.  Certain
        !            52: technicalities prevent this.  For one, we are not willing to invent an 'egrep'
        !            53: option to inform the code to interpret a file of (say a hundred) word
        !            54: alternatives containing some innocent metacharacter, that it is literal
        !            55: 'fgrep' input, rather than a closure-containing 'egrep' pattern which would
        !            56: otherwise make egrep explode.  More work could be done here.
        !            57: 
        !            58:      Our motivation?  Is this not all overblown?  Perhaps, but now you can
        !            59: build a simple fast "NSA filter", or search for the seven dwarfs at leisure.
        !            60: Besides, the final nail needed to be driven into 'bm/match', which tried
        !            61: to do parallel match, but actually shuffled things out of order during its
        !            62: simplistic block-based scheme.  These programs, part of source archive also,
        !            63: are now historical curiosities.
        !            64: 
        !            65: Kanji egrep:
        !            66: 
        !            67:      Copious notes are in README.kanji.mods.  The March 1987 Unix Review
        !            68: was indispensable for pointing out the embedded "SS2" Katakana pitfalls.
        !            69: The modularity of our code as a semi-dependent filter was necessary for this
        !            70: exploration, as we have no access to AT&T/Unisoft Kanji code.  Again, JUNET
        !            71: or Sigma project people -- please respond with grep war stories or usage notes.
        !            72: 
        !            73: Worst-case r.e. handling:
        !            74: 
        !            75:      The first code release elaborately called upon a function mcilroy()
        !            76: to pipe partial match output to old egrep for tough expressions, whose
        !            77: backtracking might swamp regexp().  Some details of the DFA/NDFA tradeoff
        !            78: were discussed in pep4grep.doc[12].  Due largely to feedback from John Gilmore
        !            79: of the GNU project, the strategy was revised.  egrep.c function kernighan()
        !            80: ("let others do the hard part") now usurps calls to costly popen() by invoking
        !            81: exec() on old egrep when necessary.  Rough partial match statistics gathered
        !            82: on the fly determine the handoff.  You may revise the time reported previously
        !            83: for
        !            84:        egrep 'hoe.*g' /usr/dict/words
        !            85: 
        !            86: from 1.2 user, 1.1 sys seconds (VAX 11/750, 4.3BSD, Fuji disks) to 0.8u, 0.4s.
        !            87: For those public-spirited souls who really want to build a PD egrep out of
        !            88: what we offer, sans fallback from regexp() to an AT&T /usr/bin/egrep, the
        !            89: slippery test "egrep 'g+h+o' /usr/dict/words" will prove enlightening.
        !            90: 
        !            91: Faster -n option:
        !            92: 
        !            93:      By popular demand.  Though Boyer/Moore techniques subvert line numbering,
        !            94: we've made it faster with brute force (loop unrolling helps VAXEN, but not
        !            95: CRISPS).  Timing tests for this and other options appear in the eg.shell script.
        !            96: 
        !            97: Not so fast:
        !            98: 
        !            99:        -v, -b, -w, various r.e.'s with no rexexp() "residue"
        !           100: 
        !           101: (you'll still have to use the layered "grep host /etc/hosts | grep -w host"
        !           102: for speed.)
        !           103: 
        !           104: Other contra-indications for new [ef]?grep:
        !           105: 
        !           106:        Monster patterns
        !           107: 
        !           108:      The amazing expressions output by /usr/lib/calendar still beg for
        !           109: the lazy evaluation technique rolled into edition 8 egrep by Prof. Aho of
        !           110: Princeton.  Hinted at on p. 105 in Kernighan & Pike, lazy evaluation reduces
        !           111: standard egrep r.e. compile time.  Here the possible O(R**2) machine
        !           112: construction cost is eliminated to amortize complexity at run-time and 
        !           113: shifted to such only if a bad match actually happens.  Whew!  Fortunately,
        !           114: this is not so important for simple r.e. fare, where H. Spencer's regexp()
        !           115: works well, but it certainly helps calendar(1).
        !           116: 
        !           117:      The catch with lazy eval. is that it slows down simple matching (15-20%
        !           118: for /usr/dict/words on VAX), so it hasn't been adopted by System V egrep.
        !           119: Note that our egrep, deferring to the underlying one in /usr/bin, doesn't
        !           120: care much about these hideous beasts -- it just doesn't do better on them.
        !           121: However, [ef]?grep does well by the Kadhafi matcher (eg.shell, again).
        !           122: 
        !           123:        Long lines, small alphabets
        !           124: 
        !           125:      Finally, a comment on one rapidly burgeoning application area
        !           126: where new egrep should not be blindly proscribed -- genome sequencing.
        !           127: Though line limits have been raised (to 8192 byte buffers), much of
        !           128: GENBANK has no newlines.  The code would need modification for scanning
        !           129: freestyle.  Also, locating ACGT sequences with the current "superimposed BMG"
        !           130: over a 4-letter alphabet might actually be worse, but the global homology
        !           131: crowd probably uses a >20 letter protein alphabet (for other reasons).
        !           132: At any rate, genetic string search generally relies on more sophisticated
        !           133: methods such as dynamic programming ala Sankoff/Kruskal.
        !           134: 
        !           135:      On the other hand, large alphabets such as Kanji probably help
        !           136: performance.  As do parallel transfer disks, MACH file mapping, ...
        !           137: Your suggestions welcome.
        !           138: 
        !           139:      James A. Woods (ames!jaw)
        !           140:      NASA Ames Research Center
        !           141: 
        !           142: P.S.  Preserving author credit, [ef]?grep may be redistributed as you wish.

unix.superglobalmegacorp.com

This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.