|
|
1.1 root 1: .TH REGEXP 3 local
2: .DA 30 Nov 1985
3: .SH NAME
4: regcomp, regexec, regsub, regerror \- regular expression handler
5: .SH SYNOPSIS
6: .ft B
7: .nf
8: #include <regexp.h>
9:
10: regexp *regcomp(exp)
11: char *exp;
12:
13: int regexec(prog, string)
14: regexp *prog;
15: char *string;
16:
17: regsub(prog, source, dest)
18: regexp *prog;
19: char *source;
20: char *dest;
21:
22: regerror(msg)
23: char *msg;
24: .SH DESCRIPTION
25: These functions implement
26: .IR egrep (1)-style
27: regular expressions and supporting facilities.
28: .PP
29: .I Regcomp
30: compiles a regular expression into a structure of type
31: .IR regexp ,
32: and returns a pointer to it.
33: The space has been allocated using
34: .IR malloc (3)
35: and may be released by
36: .IR free .
37: .PP
38: .I Regexec
39: matches a NUL-terminated \fIstring\fR against the compiled regular expression
40: in \fIprog\fR.
41: It returns 1 for success and 0 for failure, and adjusts the contents of
42: \fIprog\fR's \fIstartp\fR and \fIendp\fR (see below) accordingly.
43: .PP
44: The members of a
45: .I regexp
46: structure include at least the following (not necessarily in order):
47: .PP
48: .RS
49: char *startp[NSUBEXP];
50: .br
51: char *endp[NSUBEXP];
52: .RE
53: .PP
54: where
55: .I NSUBEXP
56: is defined (as 10) in the header file.
57: Once a successful \fIregexec\fR has been done using the \fIregexp\fR,
58: each \fIstartp\fR-\fIendp\fR pair describes one substring
59: within the \fIstring\fR,
60: with the \fIstartp\fR pointing to the first character of the substring and
61: the \fIendp\fR pointing to the first character following the substring.
62: The 0th substring is the substring of \fIstring\fR that matched the whole
63: regular expression.
64: The others are those substrings that matched parenthesized expressions
65: within the regular expression, with parenthesized expressions numbered
66: in left-to-right order of their opening parentheses.
67: .PP
68: .I Regsub
69: copies \fIsource\fR to \fIdest\fR, making substitutions according to the
70: most recent \fIregexec\fR performed using \fIprog\fR.
71: Each instance of `&' in \fIsource\fR is replaced by the substring
72: indicated by \fIstartp\fR[\fI0\fR] and
73: \fIendp\fR[\fI0\fR].
74: Each instance of `\e\fIn\fR', where \fIn\fR is a digit, is replaced by
75: the substring indicated by
76: \fIstartp\fR[\fIn\fR] and
77: \fIendp\fR[\fIn\fR].
78: To get a literal `&' or `\e\fIn\fR' into \fIdest\fR, prefix it with `\e';
79: to get a literal `\e' preceding `&' or `\e\fIn\fR', prefix it with
80: another `\e'.
81: .PP
82: .I Regerror
83: is called whenever an error is detected in \fIregcomp\fR, \fIregexec\fR,
84: or \fIregsub\fR.
85: The default \fIregerror\fR writes the string \fImsg\fR,
86: with a suitable indicator of origin,
87: on the standard
88: error output
89: and invokes \fIexit\fR(2).
90: .I Regerror
91: can be replaced by the user if other actions are desirable.
92: .SH "REGULAR EXPRESSION SYNTAX"
93: A regular expression is zero or more \fIbranches\fR, separated by `|'.
94: It matches anything that matches one of the branches.
95: .PP
96: A branch is zero or more \fIpieces\fR, concatenated.
97: It matches a match for the first, followed by a match for the second, etc.
98: .PP
99: A piece is an \fIatom\fR possibly followed by `*', `+', or `?'.
100: An atom followed by `*' matches a sequence of 0 or more matches of the atom.
101: An atom followed by `+' matches a sequence of 1 or more matches of the atom.
102: An atom followed by `?' matches a match of the atom, or the null string.
103: .PP
104: An atom is a regular expression in parentheses (matching a match for the
105: regular expression), a \fIrange\fR (see below), `.'
106: (matching any single character), `^' (matching the null string at the
107: beginning of the input string), `$' (matching the null string at the
108: end of the input string), a `\e' followed by a single character (matching
109: that character), or a single character with no other significance
110: (matching that character).
111: .PP
112: A \fIrange\fR is a sequence of characters enclosed in `[]'.
113: It normally matches any single character from the sequence.
114: If the sequence begins with `^',
115: it matches any single character \fInot\fR from the rest of the sequence.
116: If two characters in the sequence are separated by `\-', this is shorthand
117: for the full list of ASCII characters between them
118: (e.g. `[0-9]' matches any decimal digit).
119: To include a literal `]' in the sequence, make it the first character
120: (following a possible `^').
121: To include a literal `\-', make it the first or last character.
122: .SH AMBIGUITY
123: If a regular expression could match two different parts of the input string,
124: it will match the one which begins earliest.
125: If both begin in the same place but match different lengths, or match
126: the same length in different ways, life gets messier, as follows.
127: .PP
128: In general, the possibilities in a list of branches are considered in
129: left-to-right order, the possibilities for `*', `+', and `?' are
130: considered longest-first, nested constructs are considered from the
131: outermost in, and concatenated constructs are considered leftmost-first.
132: The match that will be chosen is the one that uses the earliest
133: possibility in the first choice that has to be made.
134: If there is more than one choice, the next will be made in the same manner
135: (earliest possibility) subject to the decision on the first choice.
136: And so forth.
137: .PP
138: For example, `(ab|a)b*c' could match `abc' in one of two ways.
139: The first choice is between `ab' and `a'; since `ab' is earlier, and does
140: lead to a successful overall match, it is chosen.
141: Since the `b' is already spoken for,
142: the `b*' must match its last possibility\(emthe empty string\(emsince
143: it must respect the earlier choice.
144: .PP
145: In the particular case where no `|'s are present and there is only one
146: `*', `+', or `?', the net effect is that the longest possible
147: match will be chosen.
148: So `ab*', presented with `xabbbby', will match `abbbb'.
149: Note that if `ab*' is tried against `xabyabbbz', it
150: will match `ab' just after `x', due to the begins-earliest rule.
151: (In effect, the decision on where to start the match is the first choice
152: to be made, hence subsequent choices must respect it even if this leads them
153: to less-preferred alternatives.)
154: .SH SEE ALSO
155: egrep(1), expr(1)
156: .SH DIAGNOSTICS
157: \fIRegcomp\fR returns NULL for a failure
158: (\fIregerror\fR permitting),
159: where failures are syntax errors, exceeding implementation limits,
160: or applying `+' or `*' to a possibly-null operand.
161: .SH HISTORY
162: Both code and manual page were
163: written at U of T.
164: They are intended to be compatible with the Bell V8 \fIregexp\fR(3),
165: but are not derived from Bell code.
166: .SH BUGS
167: Empty branches and empty regular expressions are not portable to V8.
168: .PP
169: The restriction against
170: applying `*' or `+' to a possibly-null operand is an artifact of the
171: simplistic implementation.
172: .PP
173: Does not support \fIegrep\fR's newline-separated branches;
174: neither does the V8 \fIregexp\fR(3), though.
175: .PP
176: Due to emphasis on
177: compactness and simplicity,
178: it's not strikingly fast.
179: It does give special attention to handling simple cases quickly.
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.