Annotation of 43BSDReno/usr.bin/grep/README/pep4grep.doc2, revision 1.1.1.1

1.1       root        1: >From postnews Tue Mar 18 18:05:22 1986
                      2: Subject: More Pep for Boyer-Moore Grep (part 2 of 2)
                      3: Newsgroups: net.unix
                      4: 
                      5: #  "Gratiano speaks an infinite deal of nothing, more than any man in all
                      6:    of Venice.  His reasons are as two grains of wheat hid in two bushels of
                      7:    chaff:  you shall seek all day ere you find them, they are not worth
                      8:    the search."  -- Shakespeare, Merchant of Venice
                      9: 
                     10: ... or, part 2, "Reach out and Boyer-Moore Egrep Someone"
                     11: 
                     12:      Maybe you never use 'grep'.  Then ignore this.  But if you do, why not
                     13: use the best algorithm?  Serious addicts know that for unstructured yet
                     14: stable text, B-trees are used for speed, or something like Lesk's nifty
                     15: (and unavailable) 'grab' suite for inverted files are ways to go.  Barring file
                     16: inversion daemons for netnews and other ephemera, we are limited to the
                     17: present improvements.
                     18: 
                     19:      Proper skeptics should question why a nearly I/O-bound program (but
                     20: not for any CPU with less than the power of several VAX MIPS, alas) should
                     21: be made more so.  The question was posed in B & M's classic 1978 CACM
                     22: paper -- the answer then was to free up more CPU cycles for timesharing.
                     23: Now, our motivations are more mundane (we won't have desktop 5 MIP machines
                     24: for another year), but not only that, we've discovered that the Cray 2's
                     25: standard 'egrep' is also very anemic, performing 8-12 times as worse as ours
                     26: on simple patterns.  For shame, especially since hearing of the rumor that
                     27: certain group theorists have a search application ready for testing.
                     28: Boyer-Moore could fill in until a Cray vectorizing C compiler shows up.
                     29: Sheer speed for machines whose filesystems are cached in memory is nice too.
                     30: 
                     31:      A quick-and-dirty rundown of the debts to which the new hybrid pays
                     32: now follows.
                     33: 
                     34:        Thompson, K. T. (CACM, November 1968):
                     35:                Regular Expression Search Algorithm.  As usual, obvious
                     36:                once you understand it.  The current 'egrep'.  Still
                     37:                useful as a base.  Abstracted by Aho/Ullman as Algorithm
                     38:                9.1 in Design and Analysis of Computer Algorithms.
                     39: 
                     40:        Boyer/Moore:
                     41:                Not quite pre-Unix.  Oh well.  Modern designers should
                     42:                know better now, if they want their stuff to get out there.
                     43:                By the way, I haven't used delta2 (or 1) since the O(mn) case
                     44:                case doesn't come up too often.  Sure Knuth stood on his head
                     45:                to better the linearity, but his proof had a bug in it until
                     46:                the 1980 SIAM J. Comput. retraction.  Would you want to code
                     47:                something that even Knuth trips up on?
                     48: 
                     49:                Now to assuage nagging feelings that geneticists might want
                     50:                to search entire libraries of 9000-unit nucleotide protein
                     51:                sequences for ((AGCA|TTGCA).*TGC)|AGCT)?T?A+ or some nonsense
                     52:                which MIGHT be nonlinear, you would want delta2.  So convince
                     53:                someone to do the Galil/Apostolico/Giancarlo 2n comparison
                     54:                worst case stuff.  See egrep.c for reference.
                     55:                
                     56:        Gosper, W. (HAKMEM 1972):
                     57:                Gosper didn't get around to the Thompson-like machine until
                     58:                1972 with HAKMEM.  His PDP 10 code is nevertheless valiant.
                     59:                He is also (barely) credited with conceiving the backwards
                     60:                match idea independently.  Where is he now?
                     61:                
                     62:        Morris/Pratt:
                     63:                Nice guys, but for this purpose, has-beens.
                     64:                Neat to see a hacker's triumph bury some theory.
                     65: 
                     66:        Horspool (Software Practice & Experience, 1980):
                     67:                Now here's a Canadian after the heart of things
                     68:                (perfect hashing, text compression, NP-complete
                     69:                code generation probs., etc.)  Did some Amdahl
                     70:                timings to show that delta2 is not so hot.
                     71:                Knows about Search For Least Frequent Character First,
                     72:                which is useful for short patterns. 
                     73: 
                     74:        {,e,f}grep man page:
                     75:                The laughable bugnote "but we do not know a single algorithm
                     76:                that spans a wide enough range of space-time tradeoffs"
                     77:                certainly presumes that there is no such thing as switching
                     78:                logic.  How the 'grep' family got into a multiple-version
                     79:                mess is probably a Guy Harris story; 'egrep' looks like the
                     80:                winner, as its functionality is pretty much a superset of
                     81:                the other two.  The K & P teaser (p. 105) offers hope for
                     82:                unification, but we see no difference with extant V8 code.
                     83: 
                     84:      "Not cited in the text" -- the sexy randomized Karp/Rabin string searcher
                     85: (Sedgewick, Algorithms, or Karp's Turing Award Lecture), and the ribald classic
                     86: Time Warps, String Edits, and Macromolecules -- The Theory and Practice
                     87: of Sequence Comparison (Kruskal & Sankoff).  Inquire within.
                     88: Thanks for your patience,
                     89: 
                     90:      James A. Woods (ames!jaw)
                     91:      NASA Ames Research Center
                     92: 
                     93: P.S.
                     94:      Current applications for Boyer-Moore code include modification of 
                     95: 'fastfind' for true speed, as well as substring search for 'grab', both
                     96: benefiting from BM-style search thru incrementally-compressed files/indices.
                     97: 

unix.superglobalmegacorp.com

This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.