|
|
1.1 ! root 1: .TH REGEXP 3 local ! 2: .DA 30 Nov 1985 ! 3: .SH NAME ! 4: regcomp, regexec, regsub, regerror \- regular expression handler ! 5: .SH SYNOPSIS ! 6: .ft B ! 7: .nf ! 8: #include <regexp.h> ! 9: ! 10: regexp *regcomp(exp) ! 11: char *exp; ! 12: ! 13: int regexec(prog, string) ! 14: regexp *prog; ! 15: char *string; ! 16: ! 17: regsub(prog, source, dest) ! 18: regexp *prog; ! 19: char *source; ! 20: char *dest; ! 21: ! 22: regerror(msg) ! 23: char *msg; ! 24: .SH DESCRIPTION ! 25: These functions implement ! 26: .IR egrep (1)-style ! 27: regular expressions and supporting facilities. ! 28: .PP ! 29: .I Regcomp ! 30: compiles a regular expression into a structure of type ! 31: .IR regexp , ! 32: and returns a pointer to it. ! 33: The space has been allocated using ! 34: .IR malloc (3) ! 35: and may be released by ! 36: .IR free . ! 37: .PP ! 38: .I Regexec ! 39: matches a NUL-terminated \fIstring\fR against the compiled regular expression ! 40: in \fIprog\fR. ! 41: It returns 1 for success and 0 for failure, and adjusts the contents of ! 42: \fIprog\fR's \fIstartp\fR and \fIendp\fR (see below) accordingly. ! 43: .PP ! 44: The members of a ! 45: .I regexp ! 46: structure include at least the following (not necessarily in order): ! 47: .PP ! 48: .RS ! 49: char *startp[NSUBEXP]; ! 50: .br ! 51: char *endp[NSUBEXP]; ! 52: .RE ! 53: .PP ! 54: where ! 55: .I NSUBEXP ! 56: is defined (as 10) in the header file. ! 57: Once a successful \fIregexec\fR has been done using the \fIregexp\fR, ! 58: each \fIstartp\fR-\fIendp\fR pair describes one substring ! 59: within the \fIstring\fR, ! 60: with the \fIstartp\fR pointing to the first character of the substring and ! 61: the \fIendp\fR pointing to the first character following the substring. ! 62: The 0th substring is the substring of \fIstring\fR that matched the whole ! 63: regular expression. ! 64: The others are those substrings that matched parenthesized expressions ! 65: within the regular expression, with parenthesized expressions numbered ! 66: in left-to-right order of their opening parentheses. ! 67: .PP ! 68: .I Regsub ! 69: copies \fIsource\fR to \fIdest\fR, making substitutions according to the ! 70: most recent \fIregexec\fR performed using \fIprog\fR. ! 71: Each instance of `&' in \fIsource\fR is replaced by the substring ! 72: indicated by \fIstartp\fR[\fI0\fR] and ! 73: \fIendp\fR[\fI0\fR]. ! 74: Each instance of `\e\fIn\fR', where \fIn\fR is a digit, is replaced by ! 75: the substring indicated by ! 76: \fIstartp\fR[\fIn\fR] and ! 77: \fIendp\fR[\fIn\fR]. ! 78: To get a literal `&' or `\e\fIn\fR' into \fIdest\fR, prefix it with `\e'; ! 79: to get a literal `\e' preceding `&' or `\e\fIn\fR', prefix it with ! 80: another `\e'. ! 81: .PP ! 82: .I Regerror ! 83: is called whenever an error is detected in \fIregcomp\fR, \fIregexec\fR, ! 84: or \fIregsub\fR. ! 85: The default \fIregerror\fR writes the string \fImsg\fR, ! 86: with a suitable indicator of origin, ! 87: on the standard ! 88: error output ! 89: and invokes \fIexit\fR(2). ! 90: .I Regerror ! 91: can be replaced by the user if other actions are desirable. ! 92: .SH "REGULAR EXPRESSION SYNTAX" ! 93: A regular expression is zero or more \fIbranches\fR, separated by `|'. ! 94: It matches anything that matches one of the branches. ! 95: .PP ! 96: A branch is zero or more \fIpieces\fR, concatenated. ! 97: It matches a match for the first, followed by a match for the second, etc. ! 98: .PP ! 99: A piece is an \fIatom\fR possibly followed by `*', `+', or `?'. ! 100: An atom followed by `*' matches a sequence of 0 or more matches of the atom. ! 101: An atom followed by `+' matches a sequence of 1 or more matches of the atom. ! 102: An atom followed by `?' matches a match of the atom, or the null string. ! 103: .PP ! 104: An atom is a regular expression in parentheses (matching a match for the ! 105: regular expression), a \fIrange\fR (see below), `.' ! 106: (matching any single character), `^' (matching the null string at the ! 107: beginning of the input string), `$' (matching the null string at the ! 108: end of the input string), a `\e' followed by a single character (matching ! 109: that character), or a single character with no other significance ! 110: (matching that character). ! 111: .PP ! 112: A \fIrange\fR is a sequence of characters enclosed in `[]'. ! 113: It normally matches any single character from the sequence. ! 114: If the sequence begins with `^', ! 115: it matches any single character \fInot\fR from the rest of the sequence. ! 116: If two characters in the sequence are separated by `\-', this is shorthand ! 117: for the full list of ASCII characters between them ! 118: (e.g. `[0-9]' matches any decimal digit). ! 119: To include a literal `]' in the sequence, make it the first character ! 120: (following a possible `^'). ! 121: To include a literal `\-', make it the first or last character. ! 122: .SH AMBIGUITY ! 123: If a regular expression could match two different parts of the input string, ! 124: it will match the one which begins earliest. ! 125: If both begin in the same place but match different lengths, or match ! 126: the same length in different ways, life gets messier, as follows. ! 127: .PP ! 128: In general, the possibilities in a list of branches are considered in ! 129: left-to-right order, the possibilities for `*', `+', and `?' are ! 130: considered longest-first, nested constructs are considered from the ! 131: outermost in, and concatenated constructs are considered leftmost-first. ! 132: The match that will be chosen is the one that uses the earliest ! 133: possibility in the first choice that has to be made. ! 134: If there is more than one choice, the next will be made in the same manner ! 135: (earliest possibility) subject to the decision on the first choice. ! 136: And so forth. ! 137: .PP ! 138: For example, `(ab|a)b*c' could match `abc' in one of two ways. ! 139: The first choice is between `ab' and `a'; since `ab' is earlier, and does ! 140: lead to a successful overall match, it is chosen. ! 141: Since the `b' is already spoken for, ! 142: the `b*' must match its last possibility\(emthe empty string\(emsince ! 143: it must respect the earlier choice. ! 144: .PP ! 145: In the particular case where no `|'s are present and there is only one ! 146: `*', `+', or `?', the net effect is that the longest possible ! 147: match will be chosen. ! 148: So `ab*', presented with `xabbbby', will match `abbbb'. ! 149: Note that if `ab*' is tried against `xabyabbbz', it ! 150: will match `ab' just after `x', due to the begins-earliest rule. ! 151: (In effect, the decision on where to start the match is the first choice ! 152: to be made, hence subsequent choices must respect it even if this leads them ! 153: to less-preferred alternatives.) ! 154: .SH SEE ALSO ! 155: egrep(1), expr(1) ! 156: .SH DIAGNOSTICS ! 157: \fIRegcomp\fR returns NULL for a failure ! 158: (\fIregerror\fR permitting), ! 159: where failures are syntax errors, exceeding implementation limits, ! 160: or applying `+' or `*' to a possibly-null operand. ! 161: .SH HISTORY ! 162: Both code and manual page were ! 163: written at U of T. ! 164: They are intended to be compatible with the Bell V8 \fIregexp\fR(3), ! 165: but are not derived from Bell code. ! 166: .SH BUGS ! 167: Empty branches and empty regular expressions are not portable to V8. ! 168: .PP ! 169: The restriction against ! 170: applying `*' or `+' to a possibly-null operand is an artifact of the ! 171: simplistic implementation. ! 172: .PP ! 173: Does not support \fIegrep\fR's newline-separated branches; ! 174: neither does the V8 \fIregexp\fR(3), though. ! 175: .PP ! 176: Due to emphasis on ! 177: compactness and simplicity, ! 178: it's not strikingly fast. ! 179: It does give special attention to handling simple cases quickly.
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.