Annotation of 43BSDReno/usr.bin/grep/README/pep4grep.doc2, revision 1.1

1.1     ! root        1: >From postnews Tue Mar 18 18:05:22 1986
        !             2: Subject: More Pep for Boyer-Moore Grep (part 2 of 2)
        !             3: Newsgroups: net.unix
        !             4: 
        !             5: #  "Gratiano speaks an infinite deal of nothing, more than any man in all
        !             6:    of Venice.  His reasons are as two grains of wheat hid in two bushels of
        !             7:    chaff:  you shall seek all day ere you find them, they are not worth
        !             8:    the search."  -- Shakespeare, Merchant of Venice
        !             9: 
        !            10: ... or, part 2, "Reach out and Boyer-Moore Egrep Someone"
        !            11: 
        !            12:      Maybe you never use 'grep'.  Then ignore this.  But if you do, why not
        !            13: use the best algorithm?  Serious addicts know that for unstructured yet
        !            14: stable text, B-trees are used for speed, or something like Lesk's nifty
        !            15: (and unavailable) 'grab' suite for inverted files are ways to go.  Barring file
        !            16: inversion daemons for netnews and other ephemera, we are limited to the
        !            17: present improvements.
        !            18: 
        !            19:      Proper skeptics should question why a nearly I/O-bound program (but
        !            20: not for any CPU with less than the power of several VAX MIPS, alas) should
        !            21: be made more so.  The question was posed in B & M's classic 1978 CACM
        !            22: paper -- the answer then was to free up more CPU cycles for timesharing.
        !            23: Now, our motivations are more mundane (we won't have desktop 5 MIP machines
        !            24: for another year), but not only that, we've discovered that the Cray 2's
        !            25: standard 'egrep' is also very anemic, performing 8-12 times as worse as ours
        !            26: on simple patterns.  For shame, especially since hearing of the rumor that
        !            27: certain group theorists have a search application ready for testing.
        !            28: Boyer-Moore could fill in until a Cray vectorizing C compiler shows up.
        !            29: Sheer speed for machines whose filesystems are cached in memory is nice too.
        !            30: 
        !            31:      A quick-and-dirty rundown of the debts to which the new hybrid pays
        !            32: now follows.
        !            33: 
        !            34:        Thompson, K. T. (CACM, November 1968):
        !            35:                Regular Expression Search Algorithm.  As usual, obvious
        !            36:                once you understand it.  The current 'egrep'.  Still
        !            37:                useful as a base.  Abstracted by Aho/Ullman as Algorithm
        !            38:                9.1 in Design and Analysis of Computer Algorithms.
        !            39: 
        !            40:        Boyer/Moore:
        !            41:                Not quite pre-Unix.  Oh well.  Modern designers should
        !            42:                know better now, if they want their stuff to get out there.
        !            43:                By the way, I haven't used delta2 (or 1) since the O(mn) case
        !            44:                case doesn't come up too often.  Sure Knuth stood on his head
        !            45:                to better the linearity, but his proof had a bug in it until
        !            46:                the 1980 SIAM J. Comput. retraction.  Would you want to code
        !            47:                something that even Knuth trips up on?
        !            48: 
        !            49:                Now to assuage nagging feelings that geneticists might want
        !            50:                to search entire libraries of 9000-unit nucleotide protein
        !            51:                sequences for ((AGCA|TTGCA).*TGC)|AGCT)?T?A+ or some nonsense
        !            52:                which MIGHT be nonlinear, you would want delta2.  So convince
        !            53:                someone to do the Galil/Apostolico/Giancarlo 2n comparison
        !            54:                worst case stuff.  See egrep.c for reference.
        !            55:                
        !            56:        Gosper, W. (HAKMEM 1972):
        !            57:                Gosper didn't get around to the Thompson-like machine until
        !            58:                1972 with HAKMEM.  His PDP 10 code is nevertheless valiant.
        !            59:                He is also (barely) credited with conceiving the backwards
        !            60:                match idea independently.  Where is he now?
        !            61:                
        !            62:        Morris/Pratt:
        !            63:                Nice guys, but for this purpose, has-beens.
        !            64:                Neat to see a hacker's triumph bury some theory.
        !            65: 
        !            66:        Horspool (Software Practice & Experience, 1980):
        !            67:                Now here's a Canadian after the heart of things
        !            68:                (perfect hashing, text compression, NP-complete
        !            69:                code generation probs., etc.)  Did some Amdahl
        !            70:                timings to show that delta2 is not so hot.
        !            71:                Knows about Search For Least Frequent Character First,
        !            72:                which is useful for short patterns. 
        !            73: 
        !            74:        {,e,f}grep man page:
        !            75:                The laughable bugnote "but we do not know a single algorithm
        !            76:                that spans a wide enough range of space-time tradeoffs"
        !            77:                certainly presumes that there is no such thing as switching
        !            78:                logic.  How the 'grep' family got into a multiple-version
        !            79:                mess is probably a Guy Harris story; 'egrep' looks like the
        !            80:                winner, as its functionality is pretty much a superset of
        !            81:                the other two.  The K & P teaser (p. 105) offers hope for
        !            82:                unification, but we see no difference with extant V8 code.
        !            83: 
        !            84:      "Not cited in the text" -- the sexy randomized Karp/Rabin string searcher
        !            85: (Sedgewick, Algorithms, or Karp's Turing Award Lecture), and the ribald classic
        !            86: Time Warps, String Edits, and Macromolecules -- The Theory and Practice
        !            87: of Sequence Comparison (Kruskal & Sankoff).  Inquire within.
        !            88: Thanks for your patience,
        !            89: 
        !            90:      James A. Woods (ames!jaw)
        !            91:      NASA Ames Research Center
        !            92: 
        !            93: P.S.
        !            94:      Current applications for Boyer-Moore code include modification of 
        !            95: 'fastfind' for true speed, as well as substring search for 'grab', both
        !            96: benefiting from BM-style search thru incrementally-compressed files/indices.
        !            97: 

unix.superglobalmegacorp.com

This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.