Annotation of 43BSDReno/usr.bin/grep/README/pep4grep.doc1, revision 1.1.1.1

1.1       root        1: >From postnews Tue Mar 18 18:04:08 1986
                      2: Subject: More Pep for Boyer-Moore Grep (part 1 of 2)
                      3: Newsgroups: net.unix
                      4: 
                      5: #  The chief defect of Henry King
                      6:    Was chewing little bits of string.
                      7: 
                      8:        -- Hilaire Belloc, Cautionary Tales [1907]
                      9: 
                     10: #  Attempt the end, and never stand to doubt
                     11:    Nothing's so hard but search will find it out.
                     12: 
                     13:        -- Robert Herrick, Hesperides [1648]
                     14: 
                     15:      The world does not need another 'grep' variant.  And so, what is this
                     16: we offer?  On the surface, the exact same 'egrep' actually, but underneath,
                     17: a swift Boyer-Moore hybrid, in C, which can beat assembler versions utilizing
                     18: microcoded string search instructions.  The offering, designed in the
                     19: Kernighanian sense to utilize the existing 'egrep' when it must, also
                     20: makes use of Mr. Henry Spencer's regexp(3) functions in an unusual way.
                     21: For the edification of those without on-line access to system source code,
                     22: the vendor-supplied 'egrep' is left in a pristine state.
                     23: 
                     24:      With code now wending its way to mod.sources, we obtain the following
                     25: results.  Times (in seconds) are all measured on a VAX 11/750 system running
                     26: BSD 4.2 on Fujitsu Eagles, although our 'egrep' has been run on the Sun 2,
                     27: V7 Unix/PDP 11, Vaxen configured with System V, and, for added effect, the
                     28: NASA Ames Cray 2.
                     29: 
                     30:                        200K bytes       user   sys     notes
                     31: 
                     32:   (new) egrep  astrian /usr/dict/words  0.4    0.5    implementation by "jaw"
                     33:        match     "           "          0.5    0.5    VAX-only (Waterloo)
                     34:        bm        "           "          1.1    0.6    Peter Bain's version 2
                     35:   (old) egrep     "           "         5.6    1.7    standard 
                     36: 
                     37: [note:  the output here is the single word "Zoroastrian".]
                     38: 
                     39:      Aha, you quip -- this is all very fine for the 99 and 44/100's percent
                     40: metacharacter-free world, but what about timing for shorter strings, character
                     41: folding, as well as for the more interesting universe of extended regular 
                     42: expressions?  Samples forthwith.  (Egrep below refers to the new one, times for
                     43: the /usr/bin code being about the same as above on most any pattern.)
                     44: 
                     45:        egrep    zurich         0.4  0.5        0 words output
                     46:        egrep -i zuRich         0.4  0.5        1 
                     47:        egrep -i zeus           0.6  0.6        1
                     48:        egrep -i zen            0.7  0.6        11
                     49:        bm       zen            2.2  0.6        10
                     50:        egrep    ZZ             0.8  0.6        0
                     51:        bm       ZZ             3.0  0.7        0
                     52:        egrep -c Z              1.5  0.6        19
                     53:        bm -c    Z              5.9  0.7        19
                     54: 
                     55: Admittedly, most people (or programs) don't search for single characters,
                     56: where Boyer-Moore is a bit slow, but it's important for the layered regular
                     57: expression approach described herein.  We might point out from the above that
                     58: the popular "fold" option crippled by 'bm' costs little; it's only a slight
                     59: adjustment of the precomputed "delta" table as well as a single character
                     60: array reference in a secondary loop.  Why has Bain claimed complexity for this?
                     61: Also, the times show that the inner loop chosen for our code (modeled after
                     62: the original speedup done by Boyer-Moore for the PDP 10) consistently betters
                     63: the "blindingly fast" version by a factor of two to three.  The tipoff was
                     64: from previous paper studies (esp. Horspool, see header notes in code) noting
                     65: that the algorithm should, when implemented efficiently, best typical microcode.
                     66: Now it does. 
                     67: 
                     68:        while ( (k += delta0 ( *k )) < strend )
                     69:                ;               /* over 80% of time spent here */
                     70: 
                     71: is the key (modulo precomputation tricks), and takes but three or four
                     72: instructions on most machines.
                     73: 
                     74:      Basic method for regular expressions:
                     75: 
                     76:        (1) isolate the longest metacharacter-free pattern string via the
                     77:            "regmust" field provided by H. Spencer's regcomp() routine.
                     78: 
                     79:            (Non-kosher, but worth not re-inventing the wheel.
                     80:            v8 folks just might have to reverse-engineer Spencer's
                     81:            reverse-engineering to provide equivalent functionality.
                     82:            You see, there are many more sites running his code than v8.
                     83:            Besides, we enjoy using regexpr technology on itself.
                     84: 
                     85:        (2) for "short" input, submatching lines are passed to regexec().
                     86: 
                     87:        (3) for "long" input, start up a standard 'egrep' process via
                     88:            popen() or equivalent.  Why not just use regexec()?  Unfortunately
                     89:            for our application, Spencer's otherwise admirable finite-state
                     90:            automaton exhibits poor performance for complex expressions.
                     91:            Setting a threshold on input length, though not perfect, helps.
                     92:            If pipes on Unix were free, we'd use this way exclusively.
                     93:            Until then, we buy happiness for those who might
                     94: 
                     95:                        egrep stuff /usr/spool/news/net/unix/*
                     96: 
                     97:            or on other directories full of short files.
                     98: 
                     99: [note: the details of (3) have changed in the re-release -- see pep4grep.doc3]
                    100: 
                    101: So,
                    102:        [new] egrep -i 'hoe.*g' words           1.2  1.1
                    103:                                                {shoestring,Shoenberg}
                    104:        [new] egrep '(a|b).*zz.*[od]$'  words   1.5  1.1
                    105:                                                {blizzard,buzzword,palazzo}
                    106:        [old] egrep                             6.3  1.4
                    107: but,
                    108:        {new,old} egrep -c 'first|second'       similar times (no isolate)
                    109: 
                    110: Again, we stress that given the different nature of the simulations of the two
                    111: nondeterministic reg. expr. state-machines (one functionless), cases can be
                    112: "cooked" to show things in a bad light, so a hybrid is warranted.
                    113: We can generally do better incorporating the Boyer-Moore algorithm directly
                    114: into the AT&T code.  For the last example, the abstraction
                    115: 
                    116:        (egrep first words &; egrep second words) | sort -u | wc
                    117: 
                    118: ideally would work better on a parallel machine, but if you're expecting
                    119: something as amazing in this draft as, say, Morwen B. Thistlethwaite's 52-move
                    120: Rubik's Cube solution, you're in the wrong place.
                    121: 
                    122: [note: but see pep4grep.doc3 -- now [ef]?grep handles some parallelism fast]
                    123: 
                    124:      About options -- the SVID ones are supported (-c, -l, bonus -i for BSD),
                    125: and -s and -h works as for BSD and v8.  Note: the 'egrep' here just hands off
                    126: patterns to the old code for things like -n, -b, -v, and multiple patterns.
                    127: As a bone to throw to the enemies of the cat-v school, there is a -1 flag
                    128: (halt after printing first match), but we don't talk about it much.
                    129: Multiple patterns can done ala 'bm' but laziness in the presence of lack of
                    130: knowledge of where 'fgrep' wins has prevailed for version 1.
                    131: 
                    132:      Personally I feel that adapting ("internationalizing") the 'egrep' effort
                    133: for two-byte Kanji is FAR more important than tweeking options or tradeoffs,
                    134: so for you large-alphabet Boyer-Moore algorithm specialists, send ideas
                    135: this way.
                    136:      
                    137:      Further historical/philosophical comments follow in the sequel.
                    138: 
                    139:      James A. Woods (ames!jaw)
                    140:      NASA Ames Research Center
                    141: 

unix.superglobalmegacorp.com

This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.