|
|
1.1 root 1: The following is a summary of the somewhat plausible ideas
2: suggested for the new grep. I thank leo de witt particularly and others
3: for clearing up misconceptions and pointing out (correctly) that
4: existing tools like sed already do (or at least nearly do) what some people
5: asked for. The following points are in no particular order and no slight is
6: intended by my presentation.
7:
8: 1) named character classes, e.g. \alpha, \digit.
9: i think this is a hokey idea and dismissed it as unnecessary crud
10: but then found out it is part of the proposed regular expression
11: stuff for posix. it may creep in but i hope not.
12:
13: 2) matching multi-line patterns (\n as part of pattern)
14: this actually requires a lot of infrastructure support and thought.
15: i prefer to leave that to other more powerful programs such as sam.
16:
17: 3) print lines with context.
18: the second most requested feature but i'm not doing it. this is
19: just the job for sed. to be consistent, we just took the context
20: crap out of diff too. this is actually reasonable; showing context
21: is the job for a separate tool (pipeline difficulties apart).
22:
23: 4) print one(first matching) line and go onto the next file.
24: most of the justification for this seemed to be scanning
25: mail and/or netnews articles for the subject line; neither
26: of which gets any sympathy from me. but it is easy to do
27: and doesn't add an option; we add a new option (say -1)
28: and remove -s. -1 is just like -s except it prints the matching line.
29: then the old grep -s pattern is now grep -1 pattern > /dev/null
30: and within epsilon of being as efficent.
31:
32: 5) divert matching lines onto one fd, nonmatching onto another.
33: sorry, run grep twice.
34:
35: 6) print the Nth occurence of the pattern (N is number or list).
36: it may be possible to think of a real reason for this (i couldn't)
37: but the answer is no.
38:
39: 7) -w (pattern matches only words)
40: the most requested feature. well, it turns out that -x (exact)
41: is there because doug mcilroy wanted to match words against a dictionary.
42: it seems to have no other use. Therefore, -x is being dropped
43: (after all, it only costs a quick edit to do it yourself) and is
44: replaced by -w == (^|[^_a-zA-Z0-9])pattern($|[^_a-zA-Z0-9]).
45:
46: 8) grep should work on binary files and kanji.
47: that it should work on kanji or any character set is a given
48: (at least, any character set supported by the system V international
49: character set stuff). binary files will work too modulo the
50: following restraint: lines (between \n's) have to fit in a
51: buffer (current size 64K). violations are an error (exit 2).
52:
53: 9) -b has bogus units.
54: agreed. -b now is in bytes.
55:
56: 10) -B (add an ^ to the front of the given pattern, analogous to -x and -w)
57: -x (and -w) is enough. sorry.
58:
59: 11) recursively descend through argument lists
60: no. find | xargs is going to have to do.
61:
62: 12) read filenames on standard input
63: no. xargs will have to do.
64:
65: 13) should be as fast as bm.
66: no worries. in fact, our egrep is 3xfaster than bm. i intend to be
67: competetive with woods' egrep. it should also be as fast as fgrep for
68: multiple keywords. the new grep incorporates boyer-moore
69: as a degenerate case of Commentz-Walter, a faster replacement
70: for the fgrep algorithm.
71:
72: 14) -lv (files that don't have any matching lines)
73: -lv means print names of files that have any nonmatching lines
74: (useful, say, for checking input syntax). -L will mean print
75: names of files without selected lines.
76:
77: 15) print the part of the line that matched.
78: no. that is available at the subroutine level.
79:
80: 16) compatability with old grep/fgrep/egrep.
81: the current name for the new command is gre (aho chose it).
82: after a while, it will become our grep. there will be a -G
83: flag to take patterns a la old grep and a -F to take
84: patterns a la fgrep (that is, no metacharacters except \n == |).
85: gre is close enough to egrep to not matter.
86:
87: 17) fewer limits.
88: so far, gre will have only one limit, a line length of 64K.
89: (NO, i am not supporting arbitrary length lines (yet)!)
90: we forsee no need for any other limit. for example, the
91: current gre acts like fgrep. it is 4 times faster than
92: fgrep and has no limits; we can gre -f /usr/dict/words
93: (72K words, 600KB).
94:
95: 18) recognise file types (ignore binaries, unpack packed files etc).
96: get real. go back to your macintosh or pyramid. gre will just grep
97: files, not understand them.
98:
99: 19) handle patterns occurring multiple times per line
100: this is illdefined (how many time does aaaa occur in a line of 20 'a's?
101: in order of decreasing correctness, the answers are >=1, 17, 5).
102: For the cases people mentioned (words), pipe it thru
103: tr to put the words one per line.
104:
105: 20) why use \{\} instead of \(\)?
106: this is not yet resolved (mcilroy&ritchie vs aho&pike&me).
107: grouping is an orthogonal issue to subexpressions so why
108: use the same parentheses? the latest suggestion (by ritchie)
109: is to allow both \(\) and \{\} as grouping operators but
110: the \3 would only count one type (say \(\)). this would be much
111: better for complicated patterns with much grouping.
112:
113: 21) subroutine versions of the pattern matching stuff.
114: in a deep sense, the new grep will have no pattern matching code in it.
115: all the pattern matching code will be in libc with a uniform
116: interface. the boyer-moore and commentz-walter routines have been
117: done. the other two are egrep and back-referencing egrep.
118: lastly, regexp will be reimplemented.
119:
120: 22) support a filename of - to mean standard input.
121: a unix with /dev/stdin is largely bogus but as a sop to the poor
122: barstards having to work on BSD, gre will support -
123: as stdin (at least for a while).
124:
125: Thus, the current proposal is the following flags. it would take a GOOD
126: argument to change my mind on this list (unless it is to get rid of a flag).
127:
128: -f file pattern is (`cat file`)
129: -v nonmatching lines are 'selected'
130: -i ignore aphabetic case
131: -n print line number
132: -c print count of selected lines only
133: -l print filenames which have a selected line
134: -L print filenames who do not have a selected line
135: -b print byte offset of line begin
136: -h do not print filenames in front of matching lines
137: -H always print filenames in front of matching lines
138: -w pattern is (^|[^_a-zA-Z0-9])pattern($|[^_a-zA-Z0-9])
139: -1 print only first selected line per file
140: -e expr use expr as the pattern
141:
142: research!andrew
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.