Annotation of 43BSDReno/usr.bin/grep/README/pep4grep.doc4, revision 1.1.1.1

1.1       root        1: #  How long a time lies in one little word! -- Shakespeare, Richard II, I, iii
                      2: 
                      3: #  Fine words butter no parsnips. -- Southern proverb
                      4: 
                      5:                [ef]?grep Implementation Changes
                      6: 
                      7: 'grep' r.e. translation:
                      8: 
                      9:      To buy speed for the novice 'grep' user who deigns not to learn the
                     10: extended 'egrep' syntax, we translate 'grep' r.e.'s to the 'egrep' superset.
                     11: It is straightforward enough to surround search patterns meaningful to
                     12: 'egrep' but not to 'grep'.  Odd cases include the -w option, not implemented
                     13: in standard 'egrep', the defunct -y option, and "tagged expressions", which
                     14: are done via an exec() of /bin/grep.  Tagged exprs, like
                     15: 
                     16:        grep '\(....\).*\1' /usr/dict/words
                     17: 
                     18: which outputs chaff like "beriberi", "couscous", "hodgepodge", and
                     19: "lightweight", are weird.  The irregularity these exprs lend coupled with
                     20: a low complexity/utility ratio kept them from being part of 'egrep'.
                     21: But for this feature, old 'grep' code could be thrown away.
                     22: 
                     23: 'fgrep' improvement / (partial) unification:
                     24: 
                     25:      In the new release, we trap low-complexity disjunctions such as
                     26: 
                     27:                egrep 'boyer|moore' file
                     28: or
                     29:                fgrep 'boyer\n
                     30:                moore' file
                     31: 
                     32: (or with "-f patfile" in place of the pattern) with a method to superimpose
                     33: the non-terminals within the Boyer/Moore table.  When scanning text backwards,
                     34: other programming tricks short-circuit some tests against the pattern.
                     35: Sparing further details, which might make for a more formal writeup, it
                     36: suffices to say that although worst-case complexity here is O(Rn) with string
                     37: length 'n', and R == r.e. size, average-case for text is still sublinear.  E.g.
                     38: 
                     39:        egrep 'silver|orange|purple'    # hard-to-rhyme color test in eg.shell
                     40: 
                     41: looks at ~55000 chars in /usr/dict/words, whereas (three) separate invocations
                     42: of egrep on the individual color words make the code look at ~40000 bytes per
                     43: word.  Aho/Corrasick's 'fgrep', in contrast, must look at all 200KB in the
                     44: dictionary.  The elegant "trie" construction within "fgrep" excels, however,
                     45: for medium/large R.  An equally ambitious "reverse trie", supplementing our
                     46: extant "alternation mush", would attenuate worst-case behavior while preserving
                     47: low R speedup.  We save the addition for another day.
                     48: 
                     49:      Since the syntax for [ef]grep is similar, we thought of making egrep
                     50: hand off to fgrep for sufficiently large metacharacter-free R, as there is no
                     51: strong reason to make the user conscious of the separate algorithms.  Certain
                     52: technicalities prevent this.  For one, we are not willing to invent an 'egrep'
                     53: option to inform the code to interpret a file of (say a hundred) word
                     54: alternatives containing some innocent metacharacter, that it is literal
                     55: 'fgrep' input, rather than a closure-containing 'egrep' pattern which would
                     56: otherwise make egrep explode.  More work could be done here.
                     57: 
                     58:      Our motivation?  Is this not all overblown?  Perhaps, but now you can
                     59: build a simple fast "NSA filter", or search for the seven dwarfs at leisure.
                     60: Besides, the final nail needed to be driven into 'bm/match', which tried
                     61: to do parallel match, but actually shuffled things out of order during its
                     62: simplistic block-based scheme.  These programs, part of source archive also,
                     63: are now historical curiosities.
                     64: 
                     65: Kanji egrep:
                     66: 
                     67:      Copious notes are in README.kanji.mods.  The March 1987 Unix Review
                     68: was indispensable for pointing out the embedded "SS2" Katakana pitfalls.
                     69: The modularity of our code as a semi-dependent filter was necessary for this
                     70: exploration, as we have no access to AT&T/Unisoft Kanji code.  Again, JUNET
                     71: or Sigma project people -- please respond with grep war stories or usage notes.
                     72: 
                     73: Worst-case r.e. handling:
                     74: 
                     75:      The first code release elaborately called upon a function mcilroy()
                     76: to pipe partial match output to old egrep for tough expressions, whose
                     77: backtracking might swamp regexp().  Some details of the DFA/NDFA tradeoff
                     78: were discussed in pep4grep.doc[12].  Due largely to feedback from John Gilmore
                     79: of the GNU project, the strategy was revised.  egrep.c function kernighan()
                     80: ("let others do the hard part") now usurps calls to costly popen() by invoking
                     81: exec() on old egrep when necessary.  Rough partial match statistics gathered
                     82: on the fly determine the handoff.  You may revise the time reported previously
                     83: for
                     84:        egrep 'hoe.*g' /usr/dict/words
                     85: 
                     86: from 1.2 user, 1.1 sys seconds (VAX 11/750, 4.3BSD, Fuji disks) to 0.8u, 0.4s.
                     87: For those public-spirited souls who really want to build a PD egrep out of
                     88: what we offer, sans fallback from regexp() to an AT&T /usr/bin/egrep, the
                     89: slippery test "egrep 'g+h+o' /usr/dict/words" will prove enlightening.
                     90: 
                     91: Faster -n option:
                     92: 
                     93:      By popular demand.  Though Boyer/Moore techniques subvert line numbering,
                     94: we've made it faster with brute force (loop unrolling helps VAXEN, but not
                     95: CRISPS).  Timing tests for this and other options appear in the eg.shell script.
                     96: 
                     97: Not so fast:
                     98: 
                     99:        -v, -b, -w, various r.e.'s with no rexexp() "residue"
                    100: 
                    101: (you'll still have to use the layered "grep host /etc/hosts | grep -w host"
                    102: for speed.)
                    103: 
                    104: Other contra-indications for new [ef]?grep:
                    105: 
                    106:        Monster patterns
                    107: 
                    108:      The amazing expressions output by /usr/lib/calendar still beg for
                    109: the lazy evaluation technique rolled into edition 8 egrep by Prof. Aho of
                    110: Princeton.  Hinted at on p. 105 in Kernighan & Pike, lazy evaluation reduces
                    111: standard egrep r.e. compile time.  Here the possible O(R**2) machine
                    112: construction cost is eliminated to amortize complexity at run-time and 
                    113: shifted to such only if a bad match actually happens.  Whew!  Fortunately,
                    114: this is not so important for simple r.e. fare, where H. Spencer's regexp()
                    115: works well, but it certainly helps calendar(1).
                    116: 
                    117:      The catch with lazy eval. is that it slows down simple matching (15-20%
                    118: for /usr/dict/words on VAX), so it hasn't been adopted by System V egrep.
                    119: Note that our egrep, deferring to the underlying one in /usr/bin, doesn't
                    120: care much about these hideous beasts -- it just doesn't do better on them.
                    121: However, [ef]?grep does well by the Kadhafi matcher (eg.shell, again).
                    122: 
                    123:        Long lines, small alphabets
                    124: 
                    125:      Finally, a comment on one rapidly burgeoning application area
                    126: where new egrep should not be blindly proscribed -- genome sequencing.
                    127: Though line limits have been raised (to 8192 byte buffers), much of
                    128: GENBANK has no newlines.  The code would need modification for scanning
                    129: freestyle.  Also, locating ACGT sequences with the current "superimposed BMG"
                    130: over a 4-letter alphabet might actually be worse, but the global homology
                    131: crowd probably uses a >20 letter protein alphabet (for other reasons).
                    132: At any rate, genetic string search generally relies on more sophisticated
                    133: methods such as dynamic programming ala Sankoff/Kruskal.
                    134: 
                    135:      On the other hand, large alphabets such as Kanji probably help
                    136: performance.  As do parallel transfer disks, MACH file mapping, ...
                    137: Your suggestions welcome.
                    138: 
                    139:      James A. Woods (ames!jaw)
                    140:      NASA Ames Research Center
                    141: 
                    142: P.S.  Preserving author credit, [ef]?grep may be redistributed as you wish.

unix.superglobalmegacorp.com

This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.