Annotation of 43BSDTahoe/ucb/grep/pep4grep.doc3, revision 1.1

1.1     ! root        1: Subject: Get Hep to Kanji-Ready Five-Algorithm [ef]?grep
        !             2: 
        !             3: #  "I need very little,
        !             4:       and of that little,
        !             5:          I need very little."  -- St. Francis of Assisi
        !             6: 
        !             7:      Hybrid blitz 'egrep', whose urquell is a veritable chimaera of at least
        !             8: five string search techniques, is now further tuned.
        !             9: 
        !            10:      Posted to USENET (and the mod.sources archive) some months ago, our
        !            11: implementation of "plug-compatible" egrep.c has been extended to support:
        !            12: 
        !            13:        transparent 'grep' expression translation, to bring the speed of
        !            14:        hyper-'egrep' to bear upon searches specified via the less capable
        !            15:        'grep' syntax.
        !            16: 
        !            17:        interception of 'fgrep' for alternations of low (<= 7) cardinality,
        !            18:        using a novel method of Boyer-Moore table superimposition and
        !            19:        pre-computation magic.  the resulting speedup applies also to simple
        !            20:        metacharacter-free 'egrep'-style alternations.
        !            21: 
        !            22:        (the above two improvements are made invisible by linking the
        !            23:        grep/egrep/fgrep triumvirate.)
        !            24: 
        !            25:        a revised strategy of fallback to standard 'egrep' for hard
        !            26:        cases, which eliminates costly popen() plumbing in favor of a
        !            27:        statistically-based re-exec() dynamic.
        !            28: 
        !            29:        more complete application of fast match to standard options,
        !            30:        including -n line numbering.
        !            31: 
        !            32:        preparation for Kanji pattern input, based upon parity-marked EUC
        !            33:        codes.  new egrep.c is "eight-bit clean".  the fast algorithms
        !            34:        unfortunately currently apply only to meta-free patterns and
        !            35:        simple alternations; full Kanji regular expression treatment
        !            36:        remains problematic.  however, the new code will pass difficult
        !            37:        input through to [ef]?grep utilities in the UNIX Japan standard
        !            38:        substrate.
        !            39: 
        !            40:      Kanji capability is perhaps the most important addition, as the
        !            41: number of UNIX systems in the Orient proliferate, providing a "new market"
        !            42: for Boyer-Moore-style search.  However, actual search efficacy remains
        !            43: unknown until the Gaijin author gains feedback from JUNET or points beyond.
        !            44: For all we know, Nippon text search utilities may already incorporate
        !            45: the methods.  Tests conducted so far have been with artificial Kanji files.
        !            46: 
        !            47:      In case you are w(o|a)ndering, be reminded that no vendor source
        !            48: changes are required to use the code.  It is configured as a turbo-charged
        !            49: "front-end" to the existing section one commands, though it is (barely)
        !            50: conceivable to adapt things, at a loss in worst-case efficiency, for
        !            51: (heaven forfend!) non-Unix systems running C.  And, since we do not offer
        !            52: a minimalist grep, do not expect it to run well on minimalist UNIX clones.
        !            53: 
        !            54:      Below appears a brief timing run on Webster's 2nd wordlist.  Notes
        !            55: on implementation changes since original release follow in the next message.
        !            56: If you want to skip the rest, fine.  The easy instructions -- download
        !            57: from comp.sources [or 'anonymous' ftp of egrep.shar.Z from ames-aurora.arpa
        !            58: for the (im)patient], and run:
        !            59: 
        !            60:        make
        !            61:        sh eg.shell     # regression test amusement
        !            62:        make install
        !            63: 
        !            64: after perusing README.FIRST.  Though the bundle in ~ftp/pub at NASA Ames
        !            65: Research Center contains prerequisite support from Univ. of Toronto's
        !            66: Henry Spencer, we are not re-posting regcomp()/regexec() to comp.sources.
        !            67: John Gilmore of the GNU project has a modified regexec(), but it is not
        !            68: mandatory for running the new egrep.
        !            69: 
        !            70:      Contrary to popular belief, old egrep is not I/O bound for large
        !            71: large files on most machines.  The new version is.  One sample:
        !            72: 
        !            73:        time egrep 'u.*nix' /usr/dict/web2      (2.5 MB)
        !            74:          (yielding {Coturnix, Pseudophoenix, [Tt]urnix}), shows
        !            75: 
        !            76:                        user     sys      real   (load ave. < 1)
        !            77: 
        !            78:        VAX 11/750, 4.3 BSD, Fujitsu Eagles
        !            79:           (new)        6.8      3.8       11
        !            80:           (old)       70.0      5.5       87
        !            81: 
        !            82:        Sun 3/160, 3.2 OS, Eagle on SMD
        !            83:           (new)        1.7      2.2        5
        !            84:           (old)       14.7      1.5       16
        !            85: 
        !            86:        Cray 2, Sys 5, no disk striping
        !            87:           (new)        .93      .11        1
        !            88:           (old)       8.07      .21        8
        !            89: 
        !            90: notes:  New egrep was actually run with -i option, but not old egrep.
        !            91: Also, fumbling for three-character residue is not [ef]?grep's forte,
        !            92: so the example is conservative.
        !            93: 
        !            94: Sun 3 has higher sys time for some unknown reason (a guess:  VAX 4.3 kernel
        !            95: handles read() calls with oddsize buffers differently?).  Cray 2 reportedly
        !            96: does disk I/O at 5-10 megabytes per second, but the architecture/compiler
        !            97: is bad at byte addressing -- no cache, no vectors here.  Unfair comparison:
        !            98: new egrep on Sun beats old egrep on Cray 2, even with fast Cray I/O!
        !            99: 
        !           100: Speculation:  the code might be useful on the Macintosh II, even if the Unix
        !           101: filesystem (Sys 5?) were to waste 3/4 of the 1 MB/sec SCSI disk bandwidth.
        !           102: Mac 2 testers please forward info to ames!jaw.
        !           103: 
        !           104:      Another metric is inner loop efficiency:
        !           105: 
        !           106:                                # instructions
        !           107:        VAX Berkeley cc                 5
        !           108:        Sun 68020 3.2 cc                6
        !           109:        Stallman's GNU 68020 cc         4
        !           110:        MIPS R2000                      6
        !           111:        Cray 2                         25
        !           112: 
        !           113: Thanks goes to mips!dce (David Elliott) for his testing time, as well as
        !           114: to RMS for two-upping Sun's C compiler.
        !           115: 
        !           116:      Of course, if you have a Connection Machine dedicated to running their
        !           117: admirable full-text keyworder on "in-core" text, you won't need [ef]?grep at
        !           118: all.  And, for unindexed text on fine-grained parallel machines, reg. expr.
        !           119: search algorithms can be made to run with a lower time bound (see the Hillis
        !           120: book).  But if your files are on disk, even a specialized search chip won't help
        !           121: once things become I/O or bus limited.  For this reason, vectorization on a
        !           122: Cray(ette) would be a bust, though Cray buffs may write the author for other
        !           123: scalar speedup ideas...
        !           124: 
        !           125: [continued]

unix.superglobalmegacorp.com

This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.