Annotation of 43BSDTahoe/ucb/grep/pep4grep.doc1, revision 1.1

1.1     ! root        1: >From postnews Tue Mar 18 18:04:08 1986
        !             2: Subject: More Pep for Boyer-Moore Grep (part 1 of 2)
        !             3: Newsgroups: net.unix
        !             4: 
        !             5: #  The chief defect of Henry King
        !             6:    Was chewing little bits of string.
        !             7: 
        !             8:        -- Hilaire Belloc, Cautionary Tales [1907]
        !             9: 
        !            10: #  Attempt the end, and never stand to doubt
        !            11:    Nothing's so hard but search will find it out.
        !            12: 
        !            13:        -- Robert Herrick, Hesperides [1648]
        !            14: 
        !            15:      The world does not need another 'grep' variant.  And so, what is this
        !            16: we offer?  On the surface, the exact same 'egrep' actually, but underneath,
        !            17: a swift Boyer-Moore hybrid, in C, which can beat assembler versions utilizing
        !            18: microcoded string search instructions.  The offering, designed in the
        !            19: Kernighanian sense to utilize the existing 'egrep' when it must, also
        !            20: makes use of Mr. Henry Spencer's regexp(3) functions in an unusual way.
        !            21: For the edification of those without on-line access to system source code,
        !            22: the vendor-supplied 'egrep' is left in a pristine state.
        !            23: 
        !            24:      With code now wending its way to mod.sources, we obtain the following
        !            25: results.  Times (in seconds) are all measured on a VAX 11/750 system running
        !            26: BSD 4.2 on Fujitsu Eagles, although our 'egrep' has been run on the Sun 2,
        !            27: V7 Unix/PDP 11, Vaxen configured with System V, and, for added effect, the
        !            28: NASA Ames Cray 2.
        !            29: 
        !            30:                        200K bytes       user   sys     notes
        !            31: 
        !            32:   (new) egrep  astrian /usr/dict/words  0.4    0.5    implementation by "jaw"
        !            33:        match     "           "          0.5    0.5    VAX-only (Waterloo)
        !            34:        bm        "           "          1.1    0.6    Peter Bain's version 2
        !            35:   (old) egrep     "           "         5.6    1.7    standard 
        !            36: 
        !            37: [note:  the output here is the single word "Zoroastrian".]
        !            38: 
        !            39:      Aha, you quip -- this is all very fine for the 99 and 44/100's percent
        !            40: metacharacter-free world, but what about timing for shorter strings, character
        !            41: folding, as well as for the more interesting universe of extended regular 
        !            42: expressions?  Samples forthwith.  (Egrep below refers to the new one, times for
        !            43: the /usr/bin code being about the same as above on most any pattern.)
        !            44: 
        !            45:        egrep    zurich         0.4  0.5        0 words output
        !            46:        egrep -i zuRich         0.4  0.5        1 
        !            47:        egrep -i zeus           0.6  0.6        1
        !            48:        egrep -i zen            0.7  0.6        11
        !            49:        bm       zen            2.2  0.6        10
        !            50:        egrep    ZZ             0.8  0.6        0
        !            51:        bm       ZZ             3.0  0.7        0
        !            52:        egrep -c Z              1.5  0.6        19
        !            53:        bm -c    Z              5.9  0.7        19
        !            54: 
        !            55: Admittedly, most people (or programs) don't search for single characters,
        !            56: where Boyer-Moore is a bit slow, but it's important for the layered regular
        !            57: expression approach described herein.  We might point out from the above that
        !            58: the popular "fold" option crippled by 'bm' costs little; it's only a slight
        !            59: adjustment of the precomputed "delta" table as well as a single character
        !            60: array reference in a secondary loop.  Why has Bain claimed complexity for this?
        !            61: Also, the times show that the inner loop chosen for our code (modeled after
        !            62: the original speedup done by Boyer-Moore for the PDP 10) consistently betters
        !            63: the "blindingly fast" version by a factor of two to three.  The tipoff was
        !            64: from previous paper studies (esp. Horspool, see header notes in code) noting
        !            65: that the algorithm should, when implemented efficiently, best typical microcode.
        !            66: Now it does. 
        !            67: 
        !            68:        while ( (k += delta0 ( *k )) < strend )
        !            69:                ;               /* over 80% of time spent here */
        !            70: 
        !            71: is the key (modulo precomputation tricks), and takes but three or four
        !            72: instructions on most machines.
        !            73: 
        !            74:      Basic method for regular expressions:
        !            75: 
        !            76:        (1) isolate the longest metacharacter-free pattern string via the
        !            77:            "regmust" field provided by H. Spencer's regcomp() routine.
        !            78: 
        !            79:            (Non-kosher, but worth not re-inventing the wheel.
        !            80:            v8 folks just might have to reverse-engineer Spencer's
        !            81:            reverse-engineering to provide equivalent functionality.
        !            82:            You see, there are many more sites running his code than v8.
        !            83:            Besides, we enjoy using regexpr technology on itself.
        !            84: 
        !            85:        (2) for "short" input, submatching lines are passed to regexec().
        !            86: 
        !            87:        (3) for "long" input, start up a standard 'egrep' process via
        !            88:            popen() or equivalent.  Why not just use regexec()?  Unfortunately
        !            89:            for our application, Spencer's otherwise admirable finite-state
        !            90:            automaton exhibits poor performance for complex expressions.
        !            91:            Setting a threshold on input length, though not perfect, helps.
        !            92:            If pipes on Unix were free, we'd use this way exclusively.
        !            93:            Until then, we buy happiness for those who might
        !            94: 
        !            95:                        egrep stuff /usr/spool/news/net/unix/*
        !            96: 
        !            97:            or on other directories full of short files.
        !            98: 
        !            99: [note: the details of (3) have changed in the re-release -- see pep4grep.doc3]
        !           100: 
        !           101: So,
        !           102:        [new] egrep -i 'hoe.*g' words           1.2  1.1
        !           103:                                                {shoestring,Shoenberg}
        !           104:        [new] egrep '(a|b).*zz.*[od]$'  words   1.5  1.1
        !           105:                                                {blizzard,buzzword,palazzo}
        !           106:        [old] egrep                             6.3  1.4
        !           107: but,
        !           108:        {new,old} egrep -c 'first|second'       similar times (no isolate)
        !           109: 
        !           110: Again, we stress that given the different nature of the simulations of the two
        !           111: nondeterministic reg. expr. state-machines (one functionless), cases can be
        !           112: "cooked" to show things in a bad light, so a hybrid is warranted.
        !           113: We can generally do better incorporating the Boyer-Moore algorithm directly
        !           114: into the AT&T code.  For the last example, the abstraction
        !           115: 
        !           116:        (egrep first words &; egrep second words) | sort -u | wc
        !           117: 
        !           118: ideally would work better on a parallel machine, but if you're expecting
        !           119: something as amazing in this draft as, say, Morwen B. Thistlethwaite's 52-move
        !           120: Rubik's Cube solution, you're in the wrong place.
        !           121: 
        !           122: [note: but see pep4grep.doc3 -- now [ef]?grep handles some parallelism fast]
        !           123: 
        !           124:      About options -- the SVID ones are supported (-c, -l, bonus -i for BSD),
        !           125: and -s and -h works as for BSD and v8.  Note: the 'egrep' here just hands off
        !           126: patterns to the old code for things like -n, -b, -v, and multiple patterns.
        !           127: As a bone to throw to the enemies of the cat-v school, there is a -1 flag
        !           128: (halt after printing first match), but we don't talk about it much.
        !           129: Multiple patterns can done ala 'bm' but laziness in the presence of lack of
        !           130: knowledge of where 'fgrep' wins has prevailed for version 1.
        !           131: 
        !           132:      Personally I feel that adapting ("internationalizing") the 'egrep' effort
        !           133: for two-byte Kanji is FAR more important than tweeking options or tradeoffs,
        !           134: so for you large-alphabet Boyer-Moore algorithm specialists, send ideas
        !           135: this way.
        !           136:      
        !           137:      Further historical/philosophical comments follow in the sequel.
        !           138: 
        !           139:      James A. Woods (ames!jaw)
        !           140:      NASA Ames Research Center
        !           141: 

unix.superglobalmegacorp.com

This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.