Annotation of 43BSDTahoe/ucb/grep/README.kanji.mods, revision 1.1.1.1

1.1       root        1:      Three areas must be addressed to provide full Kanji compatibility.
                      2: Only #1 (for the non-regular expression case) has been implemented
                      3: directly in our grep/egrep-compatible Boyer-Moore-based code.
                      4: 
                      5:        (1) false middle match
                      6: 
                      7:           (a) meta-free Kanji
                      8:           (b) Kanji regexprs
                      9: 
                     10: Kanji 16-bit "EUC" data codes (see Jung/Kalash, "Yunikkusu wa Nihongo o
                     11: Hanasemasu", p. 209, Atlanta Usenix, 1986) have the upper bit on in both
                     12: bytes, so as to allow intermixing of ASCII while preserving end-of-string
                     13: detection.  'grep' must beware of matching two Kanji byte pairs in
                     14: the interior of two unrelated Kanji characters.  e.g.
                     15: 
                     16:        text:           a (k1 k2) b (k3 k4) (k5 k6)
                     17:        pattern:                       (k4   k5)        
                     18: 
                     19: is a bad match, given ascii bytes 'a' and 'b', and Kanji characters
                     20: (k1 k2), (k3 k4), and (k5 k6).  The solution for Kanji grep using
                     21: the traditional algorithm might be to anchor the pattern only at
                     22: Kanji pair boundaries while scanning forward.
                     23: 
                     24: Boyer-Moore methods cannot afford this.  So we allow false matches, then
                     25: scan backwards for legality (the first ascii byte in the text occurring
                     26: before the candidate match disambiguates).  Another appealing method,
                     27: for "layered" processing via regexp(3), is to convert the meta-free
                     28: Kanji to '(^|[^\000-\177])k1k2', assuming Henry Spencer's code is
                     29: "8-bit clean".  Case (b) (e.g. regexprs like 'k1k2.*k3k4') is similar,
                     30: though syntax translation may be more difficult.
                     31: 
                     32:        (2) closures
                     33: 
                     34:      Eight-bit egrep '(k1k2)*' [where the '*' may be '+' or '?'], would
                     35: wrongly apply the closure to the previous byte instead of the byte pair.
                     36: One solution (without touching the existing 'regexp(3)' or 'e?grep' source)
                     37: is to simply parenthesize reg exprs 'k1k2*' -> '(k1k2)*'.
                     38: [only works with egrep syntax, so should occur after the grep->egrep
                     39: expr xlation].
                     40: 
                     41:        (3) character classes
                     42: 
                     43:           (a) easy case:  [k1k2k3k4k5k6]
                     44: 
                     45:                -- just map to (k1k2|k3k4|k5k6).
                     46: 
                     47:           (b) hard:  ranges [k1k2-k3k4]
                     48: 
                     49: fail for byte-oriented char class code.
                     50: Kanji interpretation (how do ideograms collate?) is also problematic.
                     51: Translation to egrep '.*((k1k2)|(k1k2++)...|(k3k4)).*', where '++'
                     52: denotes "16-bit successor" is conceivable, but farfetched.
                     53: 
                     54:      Now, translations (1) and (2) may be done [messily] w/o touching
                     55: Spencer's code, while (3) could be farmed out to standard Kanji egrep via the
                     56: process exec mechanism already established (see pep4grep.doc[123]).
                     57: But if (3) were done this way (invoking exec()), then the other cases might
                     58: also be done without recourse to the above xlations [just match "regmust"
                     59: first, then pass false drops to the Japan Unix std.]  However, r.e.'s handled
                     60: in such a manner would make hybrid Boyer-Moore slow for small files, except for
                     61: systems running MACH.  We could have ad hoc file size vs. exec() tradeoff
                     62: detectors control things for Kanji (it's already done for Anglo exprs), but
                     63: previous success has hinged upon having the regexp(3) layer compatible with the
                     64: r.e. style of the coarser egrep utility.
                     65: 
                     66:      Thus we take the easy way out and make fast grep only apply to simple
                     67: non-r.e. Kanji.  The very best approach remains modification of proprietary
                     68: Kanji egrep to incorporate Boyer-Moore directly, by doing Boyer-Moore on the
                     69: buffers first before rescanning with the Kanji r.e. machine.  Someday.
                     70: 
                     71: -- James A. Woods (ames!jaw)
                     72: 
                     73: Postscript:  The several articles in the special issue of UNIX Review
                     74: (March 1987) have delineated the bewildering variety of codesets
                     75: (shifted JIS, HP 15/16, many EUC flavors, etc.).  A late addition to
                     76: [ef]?grep Kanji support is capability for intermixed Katakana (SS2).
                     77: Full testing on real Kanji files has not been done.  Comments are welcome.

unix.superglobalmegacorp.com

This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.