|
|
1.1 ! root 1: >From postnews Tue Mar 18 18:04:08 1986 ! 2: Subject: More Pep for Boyer-Moore Grep (part 1 of 2) ! 3: Newsgroups: net.unix ! 4: ! 5: # The chief defect of Henry King ! 6: Was chewing little bits of string. ! 7: ! 8: -- Hilaire Belloc, Cautionary Tales [1907] ! 9: ! 10: # Attempt the end, and never stand to doubt ! 11: Nothing's so hard but search will find it out. ! 12: ! 13: -- Robert Herrick, Hesperides [1648] ! 14: ! 15: The world does not need another 'grep' variant. And so, what is this ! 16: we offer? On the surface, the exact same 'egrep' actually, but underneath, ! 17: a swift Boyer-Moore hybrid, in C, which can beat assembler versions utilizing ! 18: microcoded string search instructions. The offering, designed in the ! 19: Kernighanian sense to utilize the existing 'egrep' when it must, also ! 20: makes use of Mr. Henry Spencer's regexp(3) functions in an unusual way. ! 21: For the edification of those without on-line access to system source code, ! 22: the vendor-supplied 'egrep' is left in a pristine state. ! 23: ! 24: With code now wending its way to mod.sources, we obtain the following ! 25: results. Times (in seconds) are all measured on a VAX 11/750 system running ! 26: BSD 4.2 on Fujitsu Eagles, although our 'egrep' has been run on the Sun 2, ! 27: V7 Unix/PDP 11, Vaxen configured with System V, and, for added effect, the ! 28: NASA Ames Cray 2. ! 29: ! 30: 200K bytes user sys notes ! 31: ! 32: (new) egrep astrian /usr/dict/words 0.4 0.5 implementation by "jaw" ! 33: match " " 0.5 0.5 VAX-only (Waterloo) ! 34: bm " " 1.1 0.6 Peter Bain's version 2 ! 35: (old) egrep " " 5.6 1.7 standard ! 36: ! 37: [note: the output here is the single word "Zoroastrian".] ! 38: ! 39: Aha, you quip -- this is all very fine for the 99 and 44/100's percent ! 40: metacharacter-free world, but what about timing for shorter strings, character ! 41: folding, as well as for the more interesting universe of extended regular ! 42: expressions? Samples forthwith. (Egrep below refers to the new one, times for ! 43: the /usr/bin code being about the same as above on most any pattern.) ! 44: ! 45: egrep zurich 0.4 0.5 0 words output ! 46: egrep -i zuRich 0.4 0.5 1 ! 47: egrep -i zeus 0.6 0.6 1 ! 48: egrep -i zen 0.7 0.6 11 ! 49: bm zen 2.2 0.6 10 ! 50: egrep ZZ 0.8 0.6 0 ! 51: bm ZZ 3.0 0.7 0 ! 52: egrep -c Z 1.5 0.6 19 ! 53: bm -c Z 5.9 0.7 19 ! 54: ! 55: Admittedly, most people (or programs) don't search for single characters, ! 56: where Boyer-Moore is a bit slow, but it's important for the layered regular ! 57: expression approach described herein. We might point out from the above that ! 58: the popular "fold" option crippled by 'bm' costs little; it's only a slight ! 59: adjustment of the precomputed "delta" table as well as a single character ! 60: array reference in a secondary loop. Why has Bain claimed complexity for this? ! 61: Also, the times show that the inner loop chosen for our code (modeled after ! 62: the original speedup done by Boyer-Moore for the PDP 10) consistently betters ! 63: the "blindingly fast" version by a factor of two to three. The tipoff was ! 64: from previous paper studies (esp. Horspool, see header notes in code) noting ! 65: that the algorithm should, when implemented efficiently, best typical microcode. ! 66: Now it does. ! 67: ! 68: while ( (k += delta0 ( *k )) < strend ) ! 69: ; /* over 80% of time spent here */ ! 70: ! 71: is the key (modulo precomputation tricks), and takes but three or four ! 72: instructions on most machines. ! 73: ! 74: Basic method for regular expressions: ! 75: ! 76: (1) isolate the longest metacharacter-free pattern string via the ! 77: "regmust" field provided by H. Spencer's regcomp() routine. ! 78: ! 79: (Non-kosher, but worth not re-inventing the wheel. ! 80: v8 folks just might have to reverse-engineer Spencer's ! 81: reverse-engineering to provide equivalent functionality. ! 82: You see, there are many more sites running his code than v8. ! 83: Besides, we enjoy using regexpr technology on itself. ! 84: ! 85: (2) for "short" input, submatching lines are passed to regexec(). ! 86: ! 87: (3) for "long" input, start up a standard 'egrep' process via ! 88: popen() or equivalent. Why not just use regexec()? Unfortunately ! 89: for our application, Spencer's otherwise admirable finite-state ! 90: automaton exhibits poor performance for complex expressions. ! 91: Setting a threshold on input length, though not perfect, helps. ! 92: If pipes on Unix were free, we'd use this way exclusively. ! 93: Until then, we buy happiness for those who might ! 94: ! 95: egrep stuff /usr/spool/news/net/unix/* ! 96: ! 97: or on other directories full of short files. ! 98: ! 99: [note: the details of (3) have changed in the re-release -- see pep4grep.doc3] ! 100: ! 101: So, ! 102: [new] egrep -i 'hoe.*g' words 1.2 1.1 ! 103: {shoestring,Shoenberg} ! 104: [new] egrep '(a|b).*zz.*[od]$' words 1.5 1.1 ! 105: {blizzard,buzzword,palazzo} ! 106: [old] egrep 6.3 1.4 ! 107: but, ! 108: {new,old} egrep -c 'first|second' similar times (no isolate) ! 109: ! 110: Again, we stress that given the different nature of the simulations of the two ! 111: nondeterministic reg. expr. state-machines (one functionless), cases can be ! 112: "cooked" to show things in a bad light, so a hybrid is warranted. ! 113: We can generally do better incorporating the Boyer-Moore algorithm directly ! 114: into the AT&T code. For the last example, the abstraction ! 115: ! 116: (egrep first words &; egrep second words) | sort -u | wc ! 117: ! 118: ideally would work better on a parallel machine, but if you're expecting ! 119: something as amazing in this draft as, say, Morwen B. Thistlethwaite's 52-move ! 120: Rubik's Cube solution, you're in the wrong place. ! 121: ! 122: [note: but see pep4grep.doc3 -- now [ef]?grep handles some parallelism fast] ! 123: ! 124: About options -- the SVID ones are supported (-c, -l, bonus -i for BSD), ! 125: and -s and -h works as for BSD and v8. Note: the 'egrep' here just hands off ! 126: patterns to the old code for things like -n, -b, -v, and multiple patterns. ! 127: As a bone to throw to the enemies of the cat-v school, there is a -1 flag ! 128: (halt after printing first match), but we don't talk about it much. ! 129: Multiple patterns can done ala 'bm' but laziness in the presence of lack of ! 130: knowledge of where 'fgrep' wins has prevailed for version 1. ! 131: ! 132: Personally I feel that adapting ("internationalizing") the 'egrep' effort ! 133: for two-byte Kanji is FAR more important than tweeking options or tradeoffs, ! 134: so for you large-alphabet Boyer-Moore algorithm specialists, send ideas ! 135: this way. ! 136: ! 137: Further historical/philosophical comments follow in the sequel. ! 138: ! 139: James A. Woods (ames!jaw) ! 140: NASA Ames Research Center ! 141:
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.