|
|
1.1 ! root 1: .TH REGEXP 3 local ! 2: .DA 30 Nov 1985 ! 3: .SH NAME ! 4: regcomp, regexec, regsub, regerror \- regular expression handler ! 5: .SH SYNOPSIS ! 6: .ft B ! 7: .nf ! 8: #include <regexp.h> ! 9: ! 10: regexp *regcomp(exp) ! 11: char *exp; ! 12: ! 13: int regexec(prog, string) ! 14: regexp *prog; ! 15: char *string; ! 16: ! 17: regsub(prog, source, dest) ! 18: regexp *prog; ! 19: char *source; ! 20: char *dest; ! 21: ! 22: regerror(msg) ! 23: char *msg; ! 24: .SH DESCRIPTION ! 25: These functions implement ! 26: .IR egrep (1)-style ! 27: regular expressions and supporting facilities. ! 28: .PP ! 29: .I Regcomp ! 30: compiles a regular expression into a structure of type ! 31: .IR regexp , ! 32: and returns a pointer to it. ! 33: The space has been allocated using ! 34: .IR malloc (3) ! 35: and may be released by ! 36: .IR free . ! 37: .PP ! 38: .I Regexec ! 39: matches a NUL-terminated \fIstring\fR against the compiled regular expression ! 40: in \fIprog\fR. ! 41: It returns 1 for success and 0 for failure, and adjusts the contents of ! 42: \fIprog\fR's \fIstartp\fR and \fIendp\fR (see below) accordingly. ! 43: .PP ! 44: The members of a ! 45: .I regexp ! 46: structure include at least the following (not necessarily in order): ! 47: .PP ! 48: .RS ! 49: char *startp[NSUBEXP]; ! 50: .br ! 51: char *endp[NSUBEXP]; ! 52: .RE ! 53: .PP ! 54: where ! 55: .I NSUBEXP ! 56: is defined (as 10) in the header file. ! 57: Once a successful \fIregexec\fR has been done using the \fIregexp\fR, ! 58: each \fIstartp\fR-\fIendp\fR pair describes one substring ! 59: within the \fIstring\fR, ! 60: with the \fIstartp\fR pointing to the first character of the substring and ! 61: the \fIendp\fR pointing to the first character following the substring. ! 62: The 0th substring is the substring of \fIstring\fR that matched the whole ! 63: regular expression. ! 64: The others are those substrings that matched parenthesized expressions ! 65: within the regular expression, with parenthesized expressions numbered ! 66: in left-to-right order of their opening parentheses. ! 67: .PP ! 68: .I Regsub ! 69: copies \fIsource\fR to \fIdest\fR, making substitutions according to the ! 70: most recent \fIregexec\fR performed using \fIprog\fR. ! 71: Each instance of `&' in \fIsource\fR is replaced by the substring ! 72: indicated by \fIstartp\fR[\fI0\fR] and ! 73: \fIendp\fR[\fI0\fR]. ! 74: Each instance of `\e\fIn\fR', where \fIn\fR is a digit, is replaced by ! 75: the substring indicated by ! 76: \fIstartp\fR[\fIn\fR] and ! 77: \fIendp\fR[\fIn\fR]. ! 78: .PP ! 79: .I Regerror ! 80: is called whenever an error is detected in \fIregcomp\fR, \fIregexec\fR, ! 81: or \fIregsub\fR. ! 82: The default \fIregerror\fR writes the string \fImsg\fR, ! 83: with a suitable indicator of origin, ! 84: on the standard ! 85: error output ! 86: and invokes \fIexit\fR(2). ! 87: .I Regerror ! 88: can be replaced by the user if other actions are desirable. ! 89: .SH "REGULAR EXPRESSION SYNTAX" ! 90: A regular expression is zero or more \fIbranches\fR, separated by `|'. ! 91: It matches anything that matches one of the branches. ! 92: .PP ! 93: A branch is zero or more \fIpieces\fR, concatenated. ! 94: It matches a match for the first, followed by a match for the second, etc. ! 95: .PP ! 96: A piece is an \fIatom\fR possibly followed by `*', `+', or `?'. ! 97: An atom followed by `*' matches a sequence of 0 or more matches of the atom. ! 98: An atom followed by `+' matches a sequence of 1 or more matches of the atom. ! 99: An atom followed by `?' matches a match of the atom, or the null string. ! 100: .PP ! 101: An atom is a regular expression in parentheses (matching a match for the ! 102: regular expression), a \fIrange\fR (see below), `.' ! 103: (matching any single character), `^' (matching the null string at the ! 104: beginning of the input string), `$' (matching the null string at the ! 105: end of the input string), a `\e' followed by a single character (matching ! 106: that character), or a single character with no other significance ! 107: (matching that character). ! 108: .PP ! 109: A \fIrange\fR is a sequence of characters enclosed in `[]'. ! 110: It normally matches any single character from the sequence. ! 111: If the sequence begins with `^', ! 112: it matches any single character \fInot\fR from the rest of the sequence. ! 113: If two characters in the sequence are separated by `\-', this is shorthand ! 114: for the full list of ASCII characters between them ! 115: (e.g. `[0-9]' matches any decimal digit). ! 116: To include a literal `]' in the sequence, make it the first character ! 117: (following a possible `^'). ! 118: To include a literal `\-', make it the first or last character. ! 119: .SH AMBIGUITY ! 120: If a regular expression could match two different parts of the input string, ! 121: it will match the one which begins earliest. ! 122: If both begin in the same place but match different lengths, or match ! 123: the same length in different ways, life gets messier, as follows. ! 124: .PP ! 125: In general, the possibilities in a list of branches are considered in ! 126: left-to-right order, the possibilities for `*', `+', and `?' are ! 127: considered longest-first, nested constructs are considered from the ! 128: outermost in, and concatenated constructs are considered leftmost-first. ! 129: The match that will be chosen is the one that uses the earliest ! 130: possibility in the first choice that has to be made. ! 131: If there is more than one choice, the next will be made in the same manner ! 132: (earliest possibility) subject to the decision on the first choice. ! 133: And so forth. ! 134: .PP ! 135: For example, `(ab|a)b*c' could match `abc' in one of two ways. ! 136: The first choice is between `ab' and `a'; since `ab' is earlier, and does ! 137: lead to a successful overall match, it is chosen. ! 138: Since the `b' is already spoken for, ! 139: the `b*' must match its last possibility\(emthe empty string\(emsince ! 140: it must respect the earlier choice. ! 141: .PP ! 142: In the particular case where no `|'s are present and there is only one ! 143: `*', `+', or `?', the net effect is that the longest possible ! 144: match will be chosen. ! 145: So `ab*', presented with `xabbbby', will match `abbbb'. ! 146: Note that if `ab*' is tried against `xabyabbbz', it ! 147: will match `ab' just after `x', due to the begins-earliest rule. ! 148: (In effect, the decision on where to start the match is the first choice ! 149: to be made, hence subsequent choices must respect it even if this leads them ! 150: to less-preferred alternatives.) ! 151: .SH SEE ALSO ! 152: egrep(1), expr(1) ! 153: .SH DIAGNOSTICS ! 154: \fIRegcomp\fR returns NULL for a failure ! 155: (\fIregerror\fR permitting), ! 156: where failures are syntax errors, exceeding implementation limits, ! 157: or applying `+' or `*' to a possibly-null operand. ! 158: .SH HISTORY ! 159: Both code and manual page were ! 160: written at U of T. ! 161: They are intended to be compatible with the Bell V8 \fIregexp\fR(3), ! 162: but are not derived from Bell code. ! 163: .SH BUGS ! 164: Empty branches and empty regular expressions are not portable to V8. ! 165: .PP ! 166: The restriction against ! 167: applying `*' or `+' to a possibly-null operand is an artifact of the ! 168: simplistic implementation. ! 169: .PP ! 170: Does not support \fIegrep\fR's newline-separated branches; ! 171: neither does the V8 \fIregexp\fR(3), though. ! 172: .PP ! 173: Due to emphasis on ! 174: compactness and simplicity, ! 175: it's not strikingly fast. ! 176: It does give special attention to handling simple cases quickly.
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.