|
|
1.1 ! root 1: Subject: Get Hep to Kanji-Ready Five-Algorithm [ef]?grep ! 2: ! 3: # "I need very little, ! 4: and of that little, ! 5: I need very little." -- St. Francis of Assisi ! 6: ! 7: Hybrid blitz 'egrep', whose urquell is a veritable chimaera of at least ! 8: five string search techniques, is now further tuned. ! 9: ! 10: Posted to USENET (and the mod.sources archive) some months ago, our ! 11: implementation of "plug-compatible" egrep.c has been extended to support: ! 12: ! 13: transparent 'grep' expression translation, to bring the speed of ! 14: hyper-'egrep' to bear upon searches specified via the less capable ! 15: 'grep' syntax. ! 16: ! 17: interception of 'fgrep' for alternations of low (<= 7) cardinality, ! 18: using a novel method of Boyer-Moore table superimposition and ! 19: pre-computation magic. the resulting speedup applies also to simple ! 20: metacharacter-free 'egrep'-style alternations. ! 21: ! 22: (the above two improvements are made invisible by linking the ! 23: grep/egrep/fgrep triumvirate.) ! 24: ! 25: a revised strategy of fallback to standard 'egrep' for hard ! 26: cases, which eliminates costly popen() plumbing in favor of a ! 27: statistically-based re-exec() dynamic. ! 28: ! 29: more complete application of fast match to standard options, ! 30: including -n line numbering. ! 31: ! 32: preparation for Kanji pattern input, based upon parity-marked EUC ! 33: codes. new egrep.c is "eight-bit clean". the fast algorithms ! 34: unfortunately currently apply only to meta-free patterns and ! 35: simple alternations; full Kanji regular expression treatment ! 36: remains problematic. however, the new code will pass difficult ! 37: input through to [ef]?grep utilities in the UNIX Japan standard ! 38: substrate. ! 39: ! 40: Kanji capability is perhaps the most important addition, as the ! 41: number of UNIX systems in the Orient proliferate, providing a "new market" ! 42: for Boyer-Moore-style search. However, actual search efficacy remains ! 43: unknown until the Gaijin author gains feedback from JUNET or points beyond. ! 44: For all we know, Nippon text search utilities may already incorporate ! 45: the methods. Tests conducted so far have been with artificial Kanji files. ! 46: ! 47: In case you are w(o|a)ndering, be reminded that no vendor source ! 48: changes are required to use the code. It is configured as a turbo-charged ! 49: "front-end" to the existing section one commands, though it is (barely) ! 50: conceivable to adapt things, at a loss in worst-case efficiency, for ! 51: (heaven forfend!) non-Unix systems running C. And, since we do not offer ! 52: a minimalist grep, do not expect it to run well on minimalist UNIX clones. ! 53: ! 54: Below appears a brief timing run on Webster's 2nd wordlist. Notes ! 55: on implementation changes since original release follow in the next message. ! 56: If you want to skip the rest, fine. The easy instructions -- download ! 57: from comp.sources [or 'anonymous' ftp of egrep.shar.Z from ames-aurora.arpa ! 58: for the (im)patient], and run: ! 59: ! 60: make ! 61: sh eg.shell # regression test amusement ! 62: make install ! 63: ! 64: after perusing README.FIRST. Though the bundle in ~ftp/pub at NASA Ames ! 65: Research Center contains prerequisite support from Univ. of Toronto's ! 66: Henry Spencer, we are not re-posting regcomp()/regexec() to comp.sources. ! 67: John Gilmore of the GNU project has a modified regexec(), but it is not ! 68: mandatory for running the new egrep. ! 69: ! 70: Contrary to popular belief, old egrep is not I/O bound for large ! 71: large files on most machines. The new version is. One sample: ! 72: ! 73: time egrep 'u.*nix' /usr/dict/web2 (2.5 MB) ! 74: (yielding {Coturnix, Pseudophoenix, [Tt]urnix}), shows ! 75: ! 76: user sys real (load ave. < 1) ! 77: ! 78: VAX 11/750, 4.3 BSD, Fujitsu Eagles ! 79: (new) 6.8 3.8 11 ! 80: (old) 70.0 5.5 87 ! 81: ! 82: Sun 3/160, 3.2 OS, Eagle on SMD ! 83: (new) 1.7 2.2 5 ! 84: (old) 14.7 1.5 16 ! 85: ! 86: Cray 2, Sys 5, no disk striping ! 87: (new) .93 .11 1 ! 88: (old) 8.07 .21 8 ! 89: ! 90: notes: New egrep was actually run with -i option, but not old egrep. ! 91: Also, fumbling for three-character residue is not [ef]?grep's forte, ! 92: so the example is conservative. ! 93: ! 94: Sun 3 has higher sys time for some unknown reason (a guess: VAX 4.3 kernel ! 95: handles read() calls with oddsize buffers differently?). Cray 2 reportedly ! 96: does disk I/O at 5-10 megabytes per second, but the architecture/compiler ! 97: is bad at byte addressing -- no cache, no vectors here. Unfair comparison: ! 98: new egrep on Sun beats old egrep on Cray 2, even with fast Cray I/O! ! 99: ! 100: Speculation: the code might be useful on the Macintosh II, even if the Unix ! 101: filesystem (Sys 5?) were to waste 3/4 of the 1 MB/sec SCSI disk bandwidth. ! 102: Mac 2 testers please forward info to ames!jaw. ! 103: ! 104: Another metric is inner loop efficiency: ! 105: ! 106: # instructions ! 107: VAX Berkeley cc 5 ! 108: Sun 68020 3.2 cc 6 ! 109: Stallman's GNU 68020 cc 4 ! 110: MIPS R2000 6 ! 111: Cray 2 25 ! 112: ! 113: Thanks goes to mips!dce (David Elliott) for his testing time, as well as ! 114: to RMS for two-upping Sun's C compiler. ! 115: ! 116: Of course, if you have a Connection Machine dedicated to running their ! 117: admirable full-text keyworder on "in-core" text, you won't need [ef]?grep at ! 118: all. And, for unindexed text on fine-grained parallel machines, reg. expr. ! 119: search algorithms can be made to run with a lower time bound (see the Hillis ! 120: book). But if your files are on disk, even a specialized search chip won't help ! 121: once things become I/O or bus limited. For this reason, vectorization on a ! 122: Cray(ette) would be a bust, though Cray buffs may write the author for other ! 123: scalar speedup ideas... ! 124: ! 125: [continued]
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.