43BSDReno/usr.bin/grep/README/pep4grep.doc3 - annotate

Return to pep4grep.doc3 CVS log
Up to [CSRG BSD Unix] / 43BSDReno / usr.bin / grep / README
Annotation of 43BSDReno/usr.bin/grep/README/pep4grep.doc3, revision 1.1.1.1

1.1       root        1: Subject: Get Hep to Kanji-Ready Five-Algorithm [ef]?grep
                      2: 
                      3: #  "I need very little,
                      4:       and of that little,
                      5:          I need very little."  -- St. Francis of Assisi
                      6: 
                      7:      Hybrid blitz 'egrep', whose urquell is a veritable chimaera of at least
                      8: five string search techniques, is now further tuned.
                      9: 
                     10:      Posted to USENET (and the mod.sources archive) some months ago, our
                     11: implementation of "plug-compatible" egrep.c has been extended to support:
                     12: 
                     13:        transparent 'grep' expression translation, to bring the speed of
                     14:        hyper-'egrep' to bear upon searches specified via the less capable
                     15:        'grep' syntax.
                     16: 
                     17:        interception of 'fgrep' for alternations of low (<= 7) cardinality,
                     18:        using a novel method of Boyer-Moore table superimposition and
                     19:        pre-computation magic.  the resulting speedup applies also to simple
                     20:        metacharacter-free 'egrep'-style alternations.
                     21: 
                     22:        (the above two improvements are made invisible by linking the
                     23:        grep/egrep/fgrep triumvirate.)
                     24: 
                     25:        a revised strategy of fallback to standard 'egrep' for hard
                     26:        cases, which eliminates costly popen() plumbing in favor of a
                     27:        statistically-based re-exec() dynamic.
                     28: 
                     29:        more complete application of fast match to standard options,
                     30:        including -n line numbering.
                     31: 
                     32:        preparation for Kanji pattern input, based upon parity-marked EUC
                     33:        codes.  new egrep.c is "eight-bit clean".  the fast algorithms
                     34:        unfortunately currently apply only to meta-free patterns and
                     35:        simple alternations; full Kanji regular expression treatment
                     36:        remains problematic.  however, the new code will pass difficult
                     37:        input through to [ef]?grep utilities in the UNIX Japan standard
                     38:        substrate.
                     39: 
                     40:      Kanji capability is perhaps the most important addition, as the
                     41: number of UNIX systems in the Orient proliferate, providing a "new market"
                     42: for Boyer-Moore-style search.  However, actual search efficacy remains
                     43: unknown until the Gaijin author gains feedback from JUNET or points beyond.
                     44: For all we know, Nippon text search utilities may already incorporate
                     45: the methods.  Tests conducted so far have been with artificial Kanji files.
                     46: 
                     47:      In case you are w(o|a)ndering, be reminded that no vendor source
                     48: changes are required to use the code.  It is configured as a turbo-charged
                     49: "front-end" to the existing section one commands, though it is (barely)
                     50: conceivable to adapt things, at a loss in worst-case efficiency, for
                     51: (heaven forfend!) non-Unix systems running C.  And, since we do not offer
                     52: a minimalist grep, do not expect it to run well on minimalist UNIX clones.
                     53: 
                     54:      Below appears a brief timing run on Webster's 2nd wordlist.  Notes
                     55: on implementation changes since original release follow in the next message.
                     56: If you want to skip the rest, fine.  The easy instructions -- download
                     57: from comp.sources [or 'anonymous' ftp of egrep.shar.Z from ames-aurora.arpa
                     58: for the (im)patient], and run:
                     59: 
                     60:        make
                     61:        sh eg.shell     # regression test amusement
                     62:        make install
                     63: 
                     64: after perusing README.FIRST.  Though the bundle in ~ftp/pub at NASA Ames
                     65: Research Center contains prerequisite support from Univ. of Toronto's
                     66: Henry Spencer, we are not re-posting regcomp()/regexec() to comp.sources.
                     67: John Gilmore of the GNU project has a modified regexec(), but it is not
                     68: mandatory for running the new egrep.
                     69: 
                     70:      Contrary to popular belief, old egrep is not I/O bound for large
                     71: large files on most machines.  The new version is.  One sample:
                     72: 
                     73:        time egrep 'u.*nix' /usr/dict/web2      (2.5 MB)
                     74:          (yielding {Coturnix, Pseudophoenix, [Tt]urnix}), shows
                     75: 
                     76:                        user     sys      real   (load ave. < 1)
                     77: 
                     78:        VAX 11/750, 4.3 BSD, Fujitsu Eagles
                     79:           (new)        6.8      3.8       11
                     80:           (old)       70.0      5.5       87
                     81: 
                     82:        Sun 3/160, 3.2 OS, Eagle on SMD
                     83:           (new)        1.7      2.2        5
                     84:           (old)       14.7      1.5       16
                     85: 
                     86:        Cray 2, Sys 5, no disk striping
                     87:           (new)        .93      .11        1
                     88:           (old)       8.07      .21        8
                     89: 
                     90: notes:  New egrep was actually run with -i option, but not old egrep.
                     91: Also, fumbling for three-character residue is not [ef]?grep's forte,
                     92: so the example is conservative.
                     93: 
                     94: Sun 3 has higher sys time for some unknown reason (a guess:  VAX 4.3 kernel
                     95: handles read() calls with oddsize buffers differently?).  Cray 2 reportedly
                     96: does disk I/O at 5-10 megabytes per second, but the architecture/compiler
                     97: is bad at byte addressing -- no cache, no vectors here.  Unfair comparison:
                     98: new egrep on Sun beats old egrep on Cray 2, even with fast Cray I/O!
                     99: 
                    100: Speculation:  the code might be useful on the Macintosh II, even if the Unix
                    101: filesystem (Sys 5?) were to waste 3/4 of the 1 MB/sec SCSI disk bandwidth.
                    102: Mac 2 testers please forward info to ames!jaw.
                    103: 
                    104:      Another metric is inner loop efficiency:
                    105: 
                    106:                                # instructions
                    107:        VAX Berkeley cc                 5
                    108:        Sun 68020 3.2 cc                6
                    109:        Stallman's GNU 68020 cc         4
                    110:        MIPS R2000                      6
                    111:        Cray 2                         25
                    112: 
                    113: Thanks goes to mips!dce (David Elliott) for his testing time, as well as
                    114: to RMS for two-upping Sun's C compiler.
                    115: 
                    116:      Of course, if you have a Connection Machine dedicated to running their
                    117: admirable full-text keyworder on "in-core" text, you won't need [ef]?grep at
                    118: all.  And, for unindexed text on fine-grained parallel machines, reg. expr.
                    119: search algorithms can be made to run with a lower time bound (see the Hillis
                    120: book).  But if your files are on disk, even a specialized search chip won't help
                    121: once things become I/O or bus limited.  For this reason, vectorization on a
                    122: Cray(ette) would be a bust, though Cray buffs may write the author for other
                    123: scalar speedup ideas...
                    124: 
                    125: [continued]
unix.superglobalmegacorp.com
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.