Annotation of researchv10no/cmd/gre/gre.reply, revision 1.1.1.1

1.1       root        1:        The following is a summary of the somewhat plausible ideas
                      2: suggested for the new grep. I thank leo de witt particularly and others
                      3: for clearing up misconceptions and pointing out (correctly) that
                      4: existing tools like sed already do (or at least nearly do) what some people
                      5: asked for. The following points are in no particular order and no slight is
                      6: intended by my presentation.
                      7: 
                      8: 1) named character classes, e.g. \alpha, \digit.
                      9:        i think this is a hokey idea and dismissed it as unnecessary crud
                     10:        but then found out it is part of the proposed regular expression
                     11:        stuff for posix. it may creep in but i hope not.
                     12: 
                     13: 2) matching multi-line patterns (\n as part of pattern)
                     14:        this actually requires a lot of infrastructure support and thought.
                     15:        i prefer to leave that to other more powerful programs such as sam.
                     16: 
                     17: 3) print lines with context.
                     18:        the second most requested feature but i'm not doing it. this is
                     19:        just the job for sed. to be consistent, we just took the context
                     20:        crap out of diff too. this is actually reasonable; showing context
                     21:        is the job for a separate tool (pipeline difficulties apart).
                     22: 
                     23: 4) print one(first matching) line and go onto the next file.
                     24:        most of the justification for this seemed to be scanning
                     25:        mail and/or netnews articles for the subject line; neither
                     26:        of which gets any sympathy from me. but it is easy to do
                     27:        and doesn't add an option; we add a new option (say -1)
                     28:        and remove -s. -1 is just like -s except it prints the matching line.
                     29:        then the old grep -s pattern is now grep -1 pattern > /dev/null
                     30:        and within epsilon of being as efficent.
                     31: 
                     32: 5) divert matching lines onto one fd, nonmatching onto another.
                     33:        sorry, run grep twice.
                     34: 
                     35: 6) print the Nth occurence of the pattern (N is number or list).
                     36:        it may be possible to think of a real reason for this (i couldn't)
                     37:        but the answer is no.
                     38: 
                     39: 7) -w (pattern matches only words)
                     40:        the most requested feature. well, it turns out that -x (exact)
                     41:        is there because doug mcilroy wanted to match words against a dictionary.
                     42:        it seems to have no other use. Therefore, -x is being dropped
                     43:        (after all, it only costs a quick edit to do it yourself) and is
                     44:        replaced by -w == (^|[^_a-zA-Z0-9])pattern($|[^_a-zA-Z0-9]).
                     45: 
                     46: 8) grep should work on binary files and kanji.
                     47:        that it should work on kanji or any character set is a given
                     48:        (at least, any character set supported by the system V international
                     49:        character set stuff). binary files will work too modulo the
                     50:        following restraint: lines (between \n's) have to fit in a
                     51:        buffer (current size 64K). violations are an error (exit 2).
                     52: 
                     53: 9) -b has bogus units.
                     54:        agreed. -b now is in bytes.
                     55: 
                     56: 10) -B (add an ^ to the front of the given pattern, analogous to -x and -w)
                     57:        -x (and -w) is enough. sorry.
                     58: 
                     59: 11) recursively descend through argument lists
                     60:        no. find | xargs is going to have to do.
                     61: 
                     62: 12) read filenames on standard input
                     63:        no. xargs will have to do.
                     64: 
                     65: 13) should be as fast as bm.
                     66:        no worries. in fact, our egrep is 3xfaster than bm. i intend to be
                     67:        competetive with woods' egrep. it should also be as fast as fgrep for
                     68:        multiple keywords. the new grep incorporates boyer-moore
                     69:        as a degenerate case of Commentz-Walter, a faster replacement
                     70:        for the fgrep algorithm.
                     71: 
                     72: 14) -lv (files that don't have any matching lines)
                     73:        -lv means print names of files that have any nonmatching lines
                     74:        (useful, say, for checking input syntax). -L will mean print
                     75:        names of files without selected lines.
                     76: 
                     77: 15) print the part of the line that matched.
                     78:        no. that is available at the subroutine level.
                     79: 
                     80: 16) compatability with old grep/fgrep/egrep.
                     81:        the current name for the new command is gre (aho chose it).
                     82:        after a while, it will become our grep. there will be a -G
                     83:        flag to take patterns a la old grep and a -F to take
                     84:        patterns a la fgrep (that is, no metacharacters except \n == |).
                     85:        gre is close enough to egrep to not matter.
                     86: 
                     87: 17) fewer limits.
                     88:        so far, gre will have only one limit, a line length of 64K.
                     89:        (NO, i am not supporting arbitrary length lines (yet)!)
                     90:        we forsee no need for any other limit. for example, the
                     91:        current gre acts like fgrep. it is 4 times faster than
                     92:        fgrep and has no limits; we can gre -f /usr/dict/words
                     93:        (72K words, 600KB).
                     94: 
                     95: 18) recognise file types (ignore binaries, unpack packed files etc).
                     96:        get real. go back to your macintosh or pyramid. gre will just grep
                     97:        files, not understand them.
                     98: 
                     99: 19) handle patterns occurring multiple times per line
                    100:        this is illdefined (how many time does aaaa occur in a line of 20 'a's?
                    101:        in order of decreasing correctness, the answers are >=1, 17, 5).
                    102:        For the cases people mentioned (words), pipe it thru
                    103:        tr to put the words one per line.
                    104: 
                    105: 20) why use \{\} instead of \(\)?
                    106:        this is not yet resolved (mcilroy&ritchie vs aho&pike&me).
                    107:        grouping is an orthogonal issue to subexpressions so why
                    108:        use the same parentheses? the latest suggestion (by ritchie)
                    109:        is to allow both \(\) and \{\} as grouping operators but
                    110:        the \3 would only count one type (say \(\)). this would be much
                    111:        better for complicated patterns with much grouping.
                    112: 
                    113: 21) subroutine versions of the pattern matching stuff.
                    114:        in a deep sense, the new grep will have no pattern matching code in it.
                    115:        all the pattern matching code will be in libc with a uniform
                    116:        interface. the boyer-moore and commentz-walter routines have been
                    117:        done. the other two are egrep and back-referencing egrep.
                    118:        lastly, regexp will be reimplemented.
                    119: 
                    120: 22) support a filename of - to mean standard input.
                    121:        a unix with /dev/stdin is largely bogus but as a sop to the poor
                    122:        barstards having to work on BSD, gre will support -
                    123:        as stdin (at least for a while).
                    124: 
                    125: Thus, the current proposal is the following flags. it would take a GOOD
                    126: argument to change my mind on this list (unless it is to get rid of a flag).
                    127: 
                    128: -f file        pattern is (`cat file`)
                    129: -v     nonmatching lines are 'selected'
                    130: -i     ignore aphabetic case
                    131: -n     print line number
                    132: -c     print count of selected lines only
                    133: -l     print filenames which have a selected line
                    134: -L     print filenames who do not have a selected line
                    135: -b     print byte offset of line begin
                    136: -h     do not print filenames in front of matching lines
                    137: -H     always print filenames in front of matching lines
                    138: -w     pattern is (^|[^_a-zA-Z0-9])pattern($|[^_a-zA-Z0-9])
                    139: -1     print only first selected line per file
                    140: -e expr        use expr as the pattern
                    141: 
                    142: research!andrew

unix.superglobalmegacorp.com

This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.