Annotation of 43BSDReno/lib/libc/gen/regexp.3, revision 1.1.1.1

1.1       root        1: .\"
                      2: .\"    @(#)regexp.3    5.1 (Berkeley) 5/19/88
                      3: .\"
                      4: .TH REGEXP 3 "May 19, 1988"
                      5: .UC
                      6: .SH NAME
                      7: regcomp, regexec, regsub, regerror \- regular expression handlers
                      8: .SH SYNOPSIS
                      9: .nf
                     10: .B #include <regexp.h>
                     11: .PP
                     12: .B regexp *regcomp(exp)
                     13: .B char *exp;
                     14: .PP
                     15: .B int regexec(prog, string)
                     16: .B regexp *prog;
                     17: .B char *string;
                     18: .PP
                     19: .B regsub(prog, source, dest)
                     20: .B regexp *prog;
                     21: .B char *source;
                     22: .B char *dest;
                     23: .PP
                     24: .B regerror(msg)
                     25: .B char *msg;
                     26: .fi
                     27: .SH NAME
                     28: \fIRegcomp\fP, \fIregexec\fP, \fIregsub\fP, and \fIregerror\fP implement
                     29: .IR egrep (1)-style
                     30: regular expressions and supporting facilities.
                     31: .PP
                     32: .I Regcomp
                     33: compiles a regular expression into a structure of type
                     34: .IR regexp ,
                     35: and returns a pointer to it.
                     36: The space has been allocated using
                     37: .IR malloc (3)
                     38: and may be released by
                     39: .IR free .
                     40: .PP
                     41: .I Regexec
                     42: matches a NUL-terminated \fIstring\fR against the compiled regular expression
                     43: in \fIprog\fR.
                     44: It returns 1 for success and 0 for failure, and adjusts the contents of
                     45: \fIprog\fR's \fIstartp\fR and \fIendp\fR (see below) accordingly.
                     46: .PP
                     47: The members of a
                     48: .I regexp
                     49: structure include at least the following (not necessarily in order):
                     50: .PP
                     51: .RS
                     52: char *startp[NSUBEXP];
                     53: .br
                     54: char *endp[NSUBEXP];
                     55: .RE
                     56: .PP
                     57: where
                     58: .I NSUBEXP
                     59: is defined (as 10) in the header file.
                     60: Once a successful \fIregexec\fR has been done using the \fIregexp\fR,
                     61: each \fIstartp\fR-\fIendp\fR pair describes one substring
                     62: within the \fIstring\fR,
                     63: with the \fIstartp\fR pointing to the first character of the substring and
                     64: the \fIendp\fR pointing to the first character following the substring.
                     65: The 0th substring is the substring of \fIstring\fR that matched the whole
                     66: regular expression.
                     67: The others are those substrings that matched parenthesized expressions
                     68: within the regular expression, with parenthesized expressions numbered
                     69: in left-to-right order of their opening parentheses.
                     70: .PP
                     71: .I Regsub
                     72: copies \fIsource\fR to \fIdest\fR, making substitutions according to the
                     73: most recent \fIregexec\fR performed using \fIprog\fR.
                     74: Each instance of `&' in \fIsource\fR is replaced by the substring
                     75: indicated by \fIstartp\fR[\fI0\fR] and
                     76: \fIendp\fR[\fI0\fR].
                     77: Each instance of `\e\fIn\fR', where \fIn\fR is a digit, is replaced by
                     78: the substring indicated by
                     79: \fIstartp\fR[\fIn\fR] and
                     80: \fIendp\fR[\fIn\fR].
                     81: To get a literal `&' or `\e\fIn\fR' into \fIdest\fR, prefix it with `\e';
                     82: to get a literal `\e' preceding `&' or `\e\fIn\fR', prefix it with
                     83: another `\e'.
                     84: .PP
                     85: .I Regerror
                     86: is called whenever an error is detected in \fIregcomp\fR, \fIregexec\fR,
                     87: or \fIregsub\fR.
                     88: The default \fIregerror\fR writes the string \fImsg\fR,
                     89: with a suitable indicator of origin,
                     90: on the standard
                     91: error output
                     92: and invokes \fIexit\fR(2).
                     93: .I Regerror
                     94: can be replaced by the user if other actions are desirable.
                     95: .SH "REGULAR EXPRESSION SYNTAX"
                     96: A regular expression is zero or more \fIbranches\fR, separated by `|'.
                     97: It matches anything that matches one of the branches.
                     98: .PP
                     99: A branch is zero or more \fIpieces\fR, concatenated.
                    100: It matches a match for the first, followed by a match for the second, etc.
                    101: .PP
                    102: A piece is an \fIatom\fR possibly followed by `*', `+', or `?'.
                    103: An atom followed by `*' matches a sequence of 0 or more matches of the atom.
                    104: An atom followed by `+' matches a sequence of 1 or more matches of the atom.
                    105: An atom followed by `?' matches a match of the atom, or the null string.
                    106: .PP
                    107: An atom is a regular expression in parentheses (matching a match for the
                    108: regular expression), a \fIrange\fR (see below), `.'
                    109: (matching any single character), `^' (matching the null string at the
                    110: beginning of the input string), `$' (matching the null string at the
                    111: end of the input string), a `\e' followed by a single character (matching
                    112: that character), or a single character with no other significance
                    113: (matching that character).
                    114: .PP
                    115: A \fIrange\fR is a sequence of characters enclosed in `[]'.
                    116: It normally matches any single character from the sequence.
                    117: If the sequence begins with `^',
                    118: it matches any single character \fInot\fR from the rest of the sequence.
                    119: If two characters in the sequence are separated by `\-', this is shorthand
                    120: for the full list of ASCII characters between them
                    121: (e.g. `[0-9]' matches any decimal digit).
                    122: To include a literal `]' in the sequence, make it the first character
                    123: (following a possible `^').
                    124: To include a literal `\-', make it the first or last character.
                    125: .SH AMBIGUITY
                    126: If a regular expression could match two different parts of the input string,
                    127: it will match the one which begins earliest.
                    128: If both begin in the same place but match different lengths, or match
                    129: the same length in different ways, life gets messier, as follows.
                    130: .PP
                    131: In general, the possibilities in a list of branches are considered in
                    132: left-to-right order, the possibilities for `*', `+', and `?' are
                    133: considered longest-first, nested constructs are considered from the
                    134: outermost in, and concatenated constructs are considered leftmost-first.
                    135: The match that will be chosen is the one that uses the earliest
                    136: possibility in the first choice that has to be made.
                    137: If there is more than one choice, the next will be made in the same manner
                    138: (earliest possibility) subject to the decision on the first choice.
                    139: And so forth.
                    140: .PP
                    141: For example, `(ab|a)b*c' could match `abc' in one of two ways.
                    142: The first choice is between `ab' and `a'; since `ab' is earlier, and does
                    143: lead to a successful overall match, it is chosen.
                    144: Since the `b' is already spoken for,
                    145: the `b*' must match its last possibility\(emthe empty string\(emsince
                    146: it must respect the earlier choice.
                    147: .PP
                    148: In the particular case where no `|'s are present and there is only one
                    149: `*', `+', or `?', the net effect is that the longest possible
                    150: match will be chosen.
                    151: So `ab*', presented with `xabbbby', will match `abbbb'.
                    152: Note that if `ab*' is tried against `xabyabbbz', it
                    153: will match `ab' just after `x', due to the begins-earliest rule.
                    154: (In effect, the decision on where to start the match is the first choice
                    155: to be made, hence subsequent choices must respect it even if this leads them
                    156: to less-preferred alternatives.)
                    157: .SH DIAGNOSTICS
                    158: \fIRegcomp\fR returns NULL for a failure
                    159: (\fIregerror\fR permitting),
                    160: where failures are syntax errors, exceeding implementation limits,
                    161: or applying `+' or `*' to a possibly-null operand.
                    162: .SH HISTORY
                    163: Both code and manual page for \fIregcomp\fP, \fIregexec\fP, \fIregsub\fP,
                    164: and \fIregerror\fP were written at the University of Toronto.
                    165: They are intended to be compatible with the Bell V8 \fIregexp\fR(3),
                    166: but are not derived from Bell code.
                    167: .SH BUGS
                    168: Empty branches and empty regular expressions are not portable to V8.
                    169: .PP
                    170: The restriction against
                    171: applying `*' or `+' to a possibly-null operand is an artifact of the
                    172: simplistic implementation.
                    173: .PP
                    174: Does not support \fIegrep\fR's newline-separated branches;
                    175: neither does the V8 \fIregexp\fR(3), though.
                    176: .PP
                    177: Due to emphasis on
                    178: compactness and simplicity,
                    179: it's not strikingly fast.
                    180: It does give special attention to handling simple cases quickly.
                    181: .SH "SEE ALSO"
                    182: ed(1), ex(1), expr(1), egrep(1), fgrep(1), grep(1), regex(3)

unix.superglobalmegacorp.com

This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.