|
|
1.1 ! root 1: >From postnews Tue Mar 18 18:05:22 1986 ! 2: Subject: More Pep for Boyer-Moore Grep (part 2 of 2) ! 3: Newsgroups: net.unix ! 4: ! 5: # "Gratiano speaks an infinite deal of nothing, more than any man in all ! 6: of Venice. His reasons are as two grains of wheat hid in two bushels of ! 7: chaff: you shall seek all day ere you find them, they are not worth ! 8: the search." -- Shakespeare, Merchant of Venice ! 9: ! 10: ... or, part 2, "Reach out and Boyer-Moore Egrep Someone" ! 11: ! 12: Maybe you never use 'grep'. Then ignore this. But if you do, why not ! 13: use the best algorithm? Serious addicts know that for unstructured yet ! 14: stable text, B-trees are used for speed, or something like Lesk's nifty ! 15: (and unavailable) 'grab' suite for inverted files are ways to go. Barring file ! 16: inversion daemons for netnews and other ephemera, we are limited to the ! 17: present improvements. ! 18: ! 19: Proper skeptics should question why a nearly I/O-bound program (but ! 20: not for any CPU with less than the power of several VAX MIPS, alas) should ! 21: be made more so. The question was posed in B & M's classic 1978 CACM ! 22: paper -- the answer then was to free up more CPU cycles for timesharing. ! 23: Now, our motivations are more mundane (we won't have desktop 5 MIP machines ! 24: for another year), but not only that, we've discovered that the Cray 2's ! 25: standard 'egrep' is also very anemic, performing 8-12 times as worse as ours ! 26: on simple patterns. For shame, especially since hearing of the rumor that ! 27: certain group theorists have a search application ready for testing. ! 28: Boyer-Moore could fill in until a Cray vectorizing C compiler shows up. ! 29: Sheer speed for machines whose filesystems are cached in memory is nice too. ! 30: ! 31: A quick-and-dirty rundown of the debts to which the new hybrid pays ! 32: now follows. ! 33: ! 34: Thompson, K. T. (CACM, November 1968): ! 35: Regular Expression Search Algorithm. As usual, obvious ! 36: once you understand it. The current 'egrep'. Still ! 37: useful as a base. Abstracted by Aho/Ullman as Algorithm ! 38: 9.1 in Design and Analysis of Computer Algorithms. ! 39: ! 40: Boyer/Moore: ! 41: Not quite pre-Unix. Oh well. Modern designers should ! 42: know better now, if they want their stuff to get out there. ! 43: By the way, I haven't used delta2 (or 1) since the O(mn) case ! 44: case doesn't come up too often. Sure Knuth stood on his head ! 45: to better the linearity, but his proof had a bug in it until ! 46: the 1980 SIAM J. Comput. retraction. Would you want to code ! 47: something that even Knuth trips up on? ! 48: ! 49: Now to assuage nagging feelings that geneticists might want ! 50: to search entire libraries of 9000-unit nucleotide protein ! 51: sequences for ((AGCA|TTGCA).*TGC)|AGCT)?T?A+ or some nonsense ! 52: which MIGHT be nonlinear, you would want delta2. So convince ! 53: someone to do the Galil/Apostolico/Giancarlo 2n comparison ! 54: worst case stuff. See egrep.c for reference. ! 55: ! 56: Gosper, W. (HAKMEM 1972): ! 57: Gosper didn't get around to the Thompson-like machine until ! 58: 1972 with HAKMEM. His PDP 10 code is nevertheless valiant. ! 59: He is also (barely) credited with conceiving the backwards ! 60: match idea independently. Where is he now? ! 61: ! 62: Morris/Pratt: ! 63: Nice guys, but for this purpose, has-beens. ! 64: Neat to see a hacker's triumph bury some theory. ! 65: ! 66: Horspool (Software Practice & Experience, 1980): ! 67: Now here's a Canadian after the heart of things ! 68: (perfect hashing, text compression, NP-complete ! 69: code generation probs., etc.) Did some Amdahl ! 70: timings to show that delta2 is not so hot. ! 71: Knows about Search For Least Frequent Character First, ! 72: which is useful for short patterns. ! 73: ! 74: {,e,f}grep man page: ! 75: The laughable bugnote "but we do not know a single algorithm ! 76: that spans a wide enough range of space-time tradeoffs" ! 77: certainly presumes that there is no such thing as switching ! 78: logic. How the 'grep' family got into a multiple-version ! 79: mess is probably a Guy Harris story; 'egrep' looks like the ! 80: winner, as its functionality is pretty much a superset of ! 81: the other two. The K & P teaser (p. 105) offers hope for ! 82: unification, but we see no difference with extant V8 code. ! 83: ! 84: "Not cited in the text" -- the sexy randomized Karp/Rabin string searcher ! 85: (Sedgewick, Algorithms, or Karp's Turing Award Lecture), and the ribald classic ! 86: Time Warps, String Edits, and Macromolecules -- The Theory and Practice ! 87: of Sequence Comparison (Kruskal & Sankoff). Inquire within. ! 88: Thanks for your patience, ! 89: ! 90: James A. Woods (ames!jaw) ! 91: NASA Ames Research Center ! 92: ! 93: P.S. ! 94: Current applications for Boyer-Moore code include modification of ! 95: 'fastfind' for true speed, as well as substring search for 'grab', both ! 96: benefiting from BM-style search thru incrementally-compressed files/indices. ! 97:
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.