|
|
1.1 root 1: >From postnews Tue Mar 18 18:05:22 1986
2: Subject: More Pep for Boyer-Moore Grep (part 2 of 2)
3: Newsgroups: net.unix
4:
5: # "Gratiano speaks an infinite deal of nothing, more than any man in all
6: of Venice. His reasons are as two grains of wheat hid in two bushels of
7: chaff: you shall seek all day ere you find them, they are not worth
8: the search." -- Shakespeare, Merchant of Venice
9:
10: ... or, part 2, "Reach out and Boyer-Moore Egrep Someone"
11:
12: Maybe you never use 'grep'. Then ignore this. But if you do, why not
13: use the best algorithm? Serious addicts know that for unstructured yet
14: stable text, B-trees are used for speed, or something like Lesk's nifty
15: (and unavailable) 'grab' suite for inverted files are ways to go. Barring file
16: inversion daemons for netnews and other ephemera, we are limited to the
17: present improvements.
18:
19: Proper skeptics should question why a nearly I/O-bound program (but
20: not for any CPU with less than the power of several VAX MIPS, alas) should
21: be made more so. The question was posed in B & M's classic 1978 CACM
22: paper -- the answer then was to free up more CPU cycles for timesharing.
23: Now, our motivations are more mundane (we won't have desktop 5 MIP machines
24: for another year), but not only that, we've discovered that the Cray 2's
25: standard 'egrep' is also very anemic, performing 8-12 times as worse as ours
26: on simple patterns. For shame, especially since hearing of the rumor that
27: certain group theorists have a search application ready for testing.
28: Boyer-Moore could fill in until a Cray vectorizing C compiler shows up.
29: Sheer speed for machines whose filesystems are cached in memory is nice too.
30:
31: A quick-and-dirty rundown of the debts to which the new hybrid pays
32: now follows.
33:
34: Thompson, K. T. (CACM, November 1968):
35: Regular Expression Search Algorithm. As usual, obvious
36: once you understand it. The current 'egrep'. Still
37: useful as a base. Abstracted by Aho/Ullman as Algorithm
38: 9.1 in Design and Analysis of Computer Algorithms.
39:
40: Boyer/Moore:
41: Not quite pre-Unix. Oh well. Modern designers should
42: know better now, if they want their stuff to get out there.
43: By the way, I haven't used delta2 (or 1) since the O(mn) case
44: case doesn't come up too often. Sure Knuth stood on his head
45: to better the linearity, but his proof had a bug in it until
46: the 1980 SIAM J. Comput. retraction. Would you want to code
47: something that even Knuth trips up on?
48:
49: Now to assuage nagging feelings that geneticists might want
50: to search entire libraries of 9000-unit nucleotide protein
51: sequences for ((AGCA|TTGCA).*TGC)|AGCT)?T?A+ or some nonsense
52: which MIGHT be nonlinear, you would want delta2. So convince
53: someone to do the Galil/Apostolico/Giancarlo 2n comparison
54: worst case stuff. See egrep.c for reference.
55:
56: Gosper, W. (HAKMEM 1972):
57: Gosper didn't get around to the Thompson-like machine until
58: 1972 with HAKMEM. His PDP 10 code is nevertheless valiant.
59: He is also (barely) credited with conceiving the backwards
60: match idea independently. Where is he now?
61:
62: Morris/Pratt:
63: Nice guys, but for this purpose, has-beens.
64: Neat to see a hacker's triumph bury some theory.
65:
66: Horspool (Software Practice & Experience, 1980):
67: Now here's a Canadian after the heart of things
68: (perfect hashing, text compression, NP-complete
69: code generation probs., etc.) Did some Amdahl
70: timings to show that delta2 is not so hot.
71: Knows about Search For Least Frequent Character First,
72: which is useful for short patterns.
73:
74: {,e,f}grep man page:
75: The laughable bugnote "but we do not know a single algorithm
76: that spans a wide enough range of space-time tradeoffs"
77: certainly presumes that there is no such thing as switching
78: logic. How the 'grep' family got into a multiple-version
79: mess is probably a Guy Harris story; 'egrep' looks like the
80: winner, as its functionality is pretty much a superset of
81: the other two. The K & P teaser (p. 105) offers hope for
82: unification, but we see no difference with extant V8 code.
83:
84: "Not cited in the text" -- the sexy randomized Karp/Rabin string searcher
85: (Sedgewick, Algorithms, or Karp's Turing Award Lecture), and the ribald classic
86: Time Warps, String Edits, and Macromolecules -- The Theory and Practice
87: of Sequence Comparison (Kruskal & Sankoff). Inquire within.
88: Thanks for your patience,
89:
90: James A. Woods (ames!jaw)
91: NASA Ames Research Center
92:
93: P.S.
94: Current applications for Boyer-Moore code include modification of
95: 'fastfind' for true speed, as well as substring search for 'grab', both
96: benefiting from BM-style search thru incrementally-compressed files/indices.
97:
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.