|
|
1.1 root 1: Subject: Get Hep to Kanji-Ready Five-Algorithm [ef]?grep
2:
3: # "I need very little,
4: and of that little,
5: I need very little." -- St. Francis of Assisi
6:
7: Hybrid blitz 'egrep', whose urquell is a veritable chimaera of at least
8: five string search techniques, is now further tuned.
9:
10: Posted to USENET (and the mod.sources archive) some months ago, our
11: implementation of "plug-compatible" egrep.c has been extended to support:
12:
13: transparent 'grep' expression translation, to bring the speed of
14: hyper-'egrep' to bear upon searches specified via the less capable
15: 'grep' syntax.
16:
17: interception of 'fgrep' for alternations of low (<= 7) cardinality,
18: using a novel method of Boyer-Moore table superimposition and
19: pre-computation magic. the resulting speedup applies also to simple
20: metacharacter-free 'egrep'-style alternations.
21:
22: (the above two improvements are made invisible by linking the
23: grep/egrep/fgrep triumvirate.)
24:
25: a revised strategy of fallback to standard 'egrep' for hard
26: cases, which eliminates costly popen() plumbing in favor of a
27: statistically-based re-exec() dynamic.
28:
29: more complete application of fast match to standard options,
30: including -n line numbering.
31:
32: preparation for Kanji pattern input, based upon parity-marked EUC
33: codes. new egrep.c is "eight-bit clean". the fast algorithms
34: unfortunately currently apply only to meta-free patterns and
35: simple alternations; full Kanji regular expression treatment
36: remains problematic. however, the new code will pass difficult
37: input through to [ef]?grep utilities in the UNIX Japan standard
38: substrate.
39:
40: Kanji capability is perhaps the most important addition, as the
41: number of UNIX systems in the Orient proliferate, providing a "new market"
42: for Boyer-Moore-style search. However, actual search efficacy remains
43: unknown until the Gaijin author gains feedback from JUNET or points beyond.
44: For all we know, Nippon text search utilities may already incorporate
45: the methods. Tests conducted so far have been with artificial Kanji files.
46:
47: In case you are w(o|a)ndering, be reminded that no vendor source
48: changes are required to use the code. It is configured as a turbo-charged
49: "front-end" to the existing section one commands, though it is (barely)
50: conceivable to adapt things, at a loss in worst-case efficiency, for
51: (heaven forfend!) non-Unix systems running C. And, since we do not offer
52: a minimalist grep, do not expect it to run well on minimalist UNIX clones.
53:
54: Below appears a brief timing run on Webster's 2nd wordlist. Notes
55: on implementation changes since original release follow in the next message.
56: If you want to skip the rest, fine. The easy instructions -- download
57: from comp.sources [or 'anonymous' ftp of egrep.shar.Z from ames-aurora.arpa
58: for the (im)patient], and run:
59:
60: make
61: sh eg.shell # regression test amusement
62: make install
63:
64: after perusing README.FIRST. Though the bundle in ~ftp/pub at NASA Ames
65: Research Center contains prerequisite support from Univ. of Toronto's
66: Henry Spencer, we are not re-posting regcomp()/regexec() to comp.sources.
67: John Gilmore of the GNU project has a modified regexec(), but it is not
68: mandatory for running the new egrep.
69:
70: Contrary to popular belief, old egrep is not I/O bound for large
71: large files on most machines. The new version is. One sample:
72:
73: time egrep 'u.*nix' /usr/dict/web2 (2.5 MB)
74: (yielding {Coturnix, Pseudophoenix, [Tt]urnix}), shows
75:
76: user sys real (load ave. < 1)
77:
78: VAX 11/750, 4.3 BSD, Fujitsu Eagles
79: (new) 6.8 3.8 11
80: (old) 70.0 5.5 87
81:
82: Sun 3/160, 3.2 OS, Eagle on SMD
83: (new) 1.7 2.2 5
84: (old) 14.7 1.5 16
85:
86: Cray 2, Sys 5, no disk striping
87: (new) .93 .11 1
88: (old) 8.07 .21 8
89:
90: notes: New egrep was actually run with -i option, but not old egrep.
91: Also, fumbling for three-character residue is not [ef]?grep's forte,
92: so the example is conservative.
93:
94: Sun 3 has higher sys time for some unknown reason (a guess: VAX 4.3 kernel
95: handles read() calls with oddsize buffers differently?). Cray 2 reportedly
96: does disk I/O at 5-10 megabytes per second, but the architecture/compiler
97: is bad at byte addressing -- no cache, no vectors here. Unfair comparison:
98: new egrep on Sun beats old egrep on Cray 2, even with fast Cray I/O!
99:
100: Speculation: the code might be useful on the Macintosh II, even if the Unix
101: filesystem (Sys 5?) were to waste 3/4 of the 1 MB/sec SCSI disk bandwidth.
102: Mac 2 testers please forward info to ames!jaw.
103:
104: Another metric is inner loop efficiency:
105:
106: # instructions
107: VAX Berkeley cc 5
108: Sun 68020 3.2 cc 6
109: Stallman's GNU 68020 cc 4
110: MIPS R2000 6
111: Cray 2 25
112:
113: Thanks goes to mips!dce (David Elliott) for his testing time, as well as
114: to RMS for two-upping Sun's C compiler.
115:
116: Of course, if you have a Connection Machine dedicated to running their
117: admirable full-text keyworder on "in-core" text, you won't need [ef]?grep at
118: all. And, for unindexed text on fine-grained parallel machines, reg. expr.
119: search algorithms can be made to run with a lower time bound (see the Hillis
120: book). But if your files are on disk, even a specialized search chip won't help
121: once things become I/O or bus limited. For this reason, vectorization on a
122: Cray(ette) would be a bust, though Cray buffs may write the author for other
123: scalar speedup ideas...
124:
125: [continued]
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.