|
|
1.1 root 1: >From postnews Tue Mar 18 18:04:08 1986
2: Subject: More Pep for Boyer-Moore Grep (part 1 of 2)
3: Newsgroups: net.unix
4:
5: # The chief defect of Henry King
6: Was chewing little bits of string.
7:
8: -- Hilaire Belloc, Cautionary Tales [1907]
9:
10: # Attempt the end, and never stand to doubt
11: Nothing's so hard but search will find it out.
12:
13: -- Robert Herrick, Hesperides [1648]
14:
15: The world does not need another 'grep' variant. And so, what is this
16: we offer? On the surface, the exact same 'egrep' actually, but underneath,
17: a swift Boyer-Moore hybrid, in C, which can beat assembler versions utilizing
18: microcoded string search instructions. The offering, designed in the
19: Kernighanian sense to utilize the existing 'egrep' when it must, also
20: makes use of Mr. Henry Spencer's regexp(3) functions in an unusual way.
21: For the edification of those without on-line access to system source code,
22: the vendor-supplied 'egrep' is left in a pristine state.
23:
24: With code now wending its way to mod.sources, we obtain the following
25: results. Times (in seconds) are all measured on a VAX 11/750 system running
26: BSD 4.2 on Fujitsu Eagles, although our 'egrep' has been run on the Sun 2,
27: V7 Unix/PDP 11, Vaxen configured with System V, and, for added effect, the
28: NASA Ames Cray 2.
29:
30: 200K bytes user sys notes
31:
32: (new) egrep astrian /usr/dict/words 0.4 0.5 implementation by "jaw"
33: match " " 0.5 0.5 VAX-only (Waterloo)
34: bm " " 1.1 0.6 Peter Bain's version 2
35: (old) egrep " " 5.6 1.7 standard
36:
37: [note: the output here is the single word "Zoroastrian".]
38:
39: Aha, you quip -- this is all very fine for the 99 and 44/100's percent
40: metacharacter-free world, but what about timing for shorter strings, character
41: folding, as well as for the more interesting universe of extended regular
42: expressions? Samples forthwith. (Egrep below refers to the new one, times for
43: the /usr/bin code being about the same as above on most any pattern.)
44:
45: egrep zurich 0.4 0.5 0 words output
46: egrep -i zuRich 0.4 0.5 1
47: egrep -i zeus 0.6 0.6 1
48: egrep -i zen 0.7 0.6 11
49: bm zen 2.2 0.6 10
50: egrep ZZ 0.8 0.6 0
51: bm ZZ 3.0 0.7 0
52: egrep -c Z 1.5 0.6 19
53: bm -c Z 5.9 0.7 19
54:
55: Admittedly, most people (or programs) don't search for single characters,
56: where Boyer-Moore is a bit slow, but it's important for the layered regular
57: expression approach described herein. We might point out from the above that
58: the popular "fold" option crippled by 'bm' costs little; it's only a slight
59: adjustment of the precomputed "delta" table as well as a single character
60: array reference in a secondary loop. Why has Bain claimed complexity for this?
61: Also, the times show that the inner loop chosen for our code (modeled after
62: the original speedup done by Boyer-Moore for the PDP 10) consistently betters
63: the "blindingly fast" version by a factor of two to three. The tipoff was
64: from previous paper studies (esp. Horspool, see header notes in code) noting
65: that the algorithm should, when implemented efficiently, best typical microcode.
66: Now it does.
67:
68: while ( (k += delta0 ( *k )) < strend )
69: ; /* over 80% of time spent here */
70:
71: is the key (modulo precomputation tricks), and takes but three or four
72: instructions on most machines.
73:
74: Basic method for regular expressions:
75:
76: (1) isolate the longest metacharacter-free pattern string via the
77: "regmust" field provided by H. Spencer's regcomp() routine.
78:
79: (Non-kosher, but worth not re-inventing the wheel.
80: v8 folks just might have to reverse-engineer Spencer's
81: reverse-engineering to provide equivalent functionality.
82: You see, there are many more sites running his code than v8.
83: Besides, we enjoy using regexpr technology on itself.
84:
85: (2) for "short" input, submatching lines are passed to regexec().
86:
87: (3) for "long" input, start up a standard 'egrep' process via
88: popen() or equivalent. Why not just use regexec()? Unfortunately
89: for our application, Spencer's otherwise admirable finite-state
90: automaton exhibits poor performance for complex expressions.
91: Setting a threshold on input length, though not perfect, helps.
92: If pipes on Unix were free, we'd use this way exclusively.
93: Until then, we buy happiness for those who might
94:
95: egrep stuff /usr/spool/news/net/unix/*
96:
97: or on other directories full of short files.
98:
99: [note: the details of (3) have changed in the re-release -- see pep4grep.doc3]
100:
101: So,
102: [new] egrep -i 'hoe.*g' words 1.2 1.1
103: {shoestring,Shoenberg}
104: [new] egrep '(a|b).*zz.*[od]$' words 1.5 1.1
105: {blizzard,buzzword,palazzo}
106: [old] egrep 6.3 1.4
107: but,
108: {new,old} egrep -c 'first|second' similar times (no isolate)
109:
110: Again, we stress that given the different nature of the simulations of the two
111: nondeterministic reg. expr. state-machines (one functionless), cases can be
112: "cooked" to show things in a bad light, so a hybrid is warranted.
113: We can generally do better incorporating the Boyer-Moore algorithm directly
114: into the AT&T code. For the last example, the abstraction
115:
116: (egrep first words &; egrep second words) | sort -u | wc
117:
118: ideally would work better on a parallel machine, but if you're expecting
119: something as amazing in this draft as, say, Morwen B. Thistlethwaite's 52-move
120: Rubik's Cube solution, you're in the wrong place.
121:
122: [note: but see pep4grep.doc3 -- now [ef]?grep handles some parallelism fast]
123:
124: About options -- the SVID ones are supported (-c, -l, bonus -i for BSD),
125: and -s and -h works as for BSD and v8. Note: the 'egrep' here just hands off
126: patterns to the old code for things like -n, -b, -v, and multiple patterns.
127: As a bone to throw to the enemies of the cat-v school, there is a -1 flag
128: (halt after printing first match), but we don't talk about it much.
129: Multiple patterns can done ala 'bm' but laziness in the presence of lack of
130: knowledge of where 'fgrep' wins has prevailed for version 1.
131:
132: Personally I feel that adapting ("internationalizing") the 'egrep' effort
133: for two-byte Kanji is FAR more important than tweeking options or tradeoffs,
134: so for you large-alphabet Boyer-Moore algorithm specialists, send ideas
135: this way.
136:
137: Further historical/philosophical comments follow in the sequel.
138:
139: James A. Woods (ames!jaw)
140: NASA Ames Research Center
141:
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.