|
|
1.1 ! root 1: .\" ! 2: .\" @(#)regexp.3 5.1 (Berkeley) 5/19/88 ! 3: .\" ! 4: .TH REGEXP 3 "May 19, 1988" ! 5: .UC ! 6: .SH NAME ! 7: regcomp, regexec, regsub, regerror \- regular expression handlers ! 8: .SH SYNOPSIS ! 9: .nf ! 10: .B #include <regexp.h> ! 11: .PP ! 12: .B regexp *regcomp(exp) ! 13: .B char *exp; ! 14: .PP ! 15: .B int regexec(prog, string) ! 16: .B regexp *prog; ! 17: .B char *string; ! 18: .PP ! 19: .B regsub(prog, source, dest) ! 20: .B regexp *prog; ! 21: .B char *source; ! 22: .B char *dest; ! 23: .PP ! 24: .B regerror(msg) ! 25: .B char *msg; ! 26: .fi ! 27: .SH NAME ! 28: \fIRegcomp\fP, \fIregexec\fP, \fIregsub\fP, and \fIregerror\fP implement ! 29: .IR egrep (1)-style ! 30: regular expressions and supporting facilities. ! 31: .PP ! 32: .I Regcomp ! 33: compiles a regular expression into a structure of type ! 34: .IR regexp , ! 35: and returns a pointer to it. ! 36: The space has been allocated using ! 37: .IR malloc (3) ! 38: and may be released by ! 39: .IR free . ! 40: .PP ! 41: .I Regexec ! 42: matches a NUL-terminated \fIstring\fR against the compiled regular expression ! 43: in \fIprog\fR. ! 44: It returns 1 for success and 0 for failure, and adjusts the contents of ! 45: \fIprog\fR's \fIstartp\fR and \fIendp\fR (see below) accordingly. ! 46: .PP ! 47: The members of a ! 48: .I regexp ! 49: structure include at least the following (not necessarily in order): ! 50: .PP ! 51: .RS ! 52: char *startp[NSUBEXP]; ! 53: .br ! 54: char *endp[NSUBEXP]; ! 55: .RE ! 56: .PP ! 57: where ! 58: .I NSUBEXP ! 59: is defined (as 10) in the header file. ! 60: Once a successful \fIregexec\fR has been done using the \fIregexp\fR, ! 61: each \fIstartp\fR-\fIendp\fR pair describes one substring ! 62: within the \fIstring\fR, ! 63: with the \fIstartp\fR pointing to the first character of the substring and ! 64: the \fIendp\fR pointing to the first character following the substring. ! 65: The 0th substring is the substring of \fIstring\fR that matched the whole ! 66: regular expression. ! 67: The others are those substrings that matched parenthesized expressions ! 68: within the regular expression, with parenthesized expressions numbered ! 69: in left-to-right order of their opening parentheses. ! 70: .PP ! 71: .I Regsub ! 72: copies \fIsource\fR to \fIdest\fR, making substitutions according to the ! 73: most recent \fIregexec\fR performed using \fIprog\fR. ! 74: Each instance of `&' in \fIsource\fR is replaced by the substring ! 75: indicated by \fIstartp\fR[\fI0\fR] and ! 76: \fIendp\fR[\fI0\fR]. ! 77: Each instance of `\e\fIn\fR', where \fIn\fR is a digit, is replaced by ! 78: the substring indicated by ! 79: \fIstartp\fR[\fIn\fR] and ! 80: \fIendp\fR[\fIn\fR]. ! 81: To get a literal `&' or `\e\fIn\fR' into \fIdest\fR, prefix it with `\e'; ! 82: to get a literal `\e' preceding `&' or `\e\fIn\fR', prefix it with ! 83: another `\e'. ! 84: .PP ! 85: .I Regerror ! 86: is called whenever an error is detected in \fIregcomp\fR, \fIregexec\fR, ! 87: or \fIregsub\fR. ! 88: The default \fIregerror\fR writes the string \fImsg\fR, ! 89: with a suitable indicator of origin, ! 90: on the standard ! 91: error output ! 92: and invokes \fIexit\fR(2). ! 93: .I Regerror ! 94: can be replaced by the user if other actions are desirable. ! 95: .SH "REGULAR EXPRESSION SYNTAX" ! 96: A regular expression is zero or more \fIbranches\fR, separated by `|'. ! 97: It matches anything that matches one of the branches. ! 98: .PP ! 99: A branch is zero or more \fIpieces\fR, concatenated. ! 100: It matches a match for the first, followed by a match for the second, etc. ! 101: .PP ! 102: A piece is an \fIatom\fR possibly followed by `*', `+', or `?'. ! 103: An atom followed by `*' matches a sequence of 0 or more matches of the atom. ! 104: An atom followed by `+' matches a sequence of 1 or more matches of the atom. ! 105: An atom followed by `?' matches a match of the atom, or the null string. ! 106: .PP ! 107: An atom is a regular expression in parentheses (matching a match for the ! 108: regular expression), a \fIrange\fR (see below), `.' ! 109: (matching any single character), `^' (matching the null string at the ! 110: beginning of the input string), `$' (matching the null string at the ! 111: end of the input string), a `\e' followed by a single character (matching ! 112: that character), or a single character with no other significance ! 113: (matching that character). ! 114: .PP ! 115: A \fIrange\fR is a sequence of characters enclosed in `[]'. ! 116: It normally matches any single character from the sequence. ! 117: If the sequence begins with `^', ! 118: it matches any single character \fInot\fR from the rest of the sequence. ! 119: If two characters in the sequence are separated by `\-', this is shorthand ! 120: for the full list of ASCII characters between them ! 121: (e.g. `[0-9]' matches any decimal digit). ! 122: To include a literal `]' in the sequence, make it the first character ! 123: (following a possible `^'). ! 124: To include a literal `\-', make it the first or last character. ! 125: .SH AMBIGUITY ! 126: If a regular expression could match two different parts of the input string, ! 127: it will match the one which begins earliest. ! 128: If both begin in the same place but match different lengths, or match ! 129: the same length in different ways, life gets messier, as follows. ! 130: .PP ! 131: In general, the possibilities in a list of branches are considered in ! 132: left-to-right order, the possibilities for `*', `+', and `?' are ! 133: considered longest-first, nested constructs are considered from the ! 134: outermost in, and concatenated constructs are considered leftmost-first. ! 135: The match that will be chosen is the one that uses the earliest ! 136: possibility in the first choice that has to be made. ! 137: If there is more than one choice, the next will be made in the same manner ! 138: (earliest possibility) subject to the decision on the first choice. ! 139: And so forth. ! 140: .PP ! 141: For example, `(ab|a)b*c' could match `abc' in one of two ways. ! 142: The first choice is between `ab' and `a'; since `ab' is earlier, and does ! 143: lead to a successful overall match, it is chosen. ! 144: Since the `b' is already spoken for, ! 145: the `b*' must match its last possibility\(emthe empty string\(emsince ! 146: it must respect the earlier choice. ! 147: .PP ! 148: In the particular case where no `|'s are present and there is only one ! 149: `*', `+', or `?', the net effect is that the longest possible ! 150: match will be chosen. ! 151: So `ab*', presented with `xabbbby', will match `abbbb'. ! 152: Note that if `ab*' is tried against `xabyabbbz', it ! 153: will match `ab' just after `x', due to the begins-earliest rule. ! 154: (In effect, the decision on where to start the match is the first choice ! 155: to be made, hence subsequent choices must respect it even if this leads them ! 156: to less-preferred alternatives.) ! 157: .SH DIAGNOSTICS ! 158: \fIRegcomp\fR returns NULL for a failure ! 159: (\fIregerror\fR permitting), ! 160: where failures are syntax errors, exceeding implementation limits, ! 161: or applying `+' or `*' to a possibly-null operand. ! 162: .SH HISTORY ! 163: Both code and manual page for \fIregcomp\fP, \fIregexec\fP, \fIregsub\fP, ! 164: and \fIregerror\fP were written at the University of Toronto. ! 165: They are intended to be compatible with the Bell V8 \fIregexp\fR(3), ! 166: but are not derived from Bell code. ! 167: .SH BUGS ! 168: Empty branches and empty regular expressions are not portable to V8. ! 169: .PP ! 170: The restriction against ! 171: applying `*' or `+' to a possibly-null operand is an artifact of the ! 172: simplistic implementation. ! 173: .PP ! 174: Does not support \fIegrep\fR's newline-separated branches; ! 175: neither does the V8 \fIregexp\fR(3), though. ! 176: .PP ! 177: Due to emphasis on ! 178: compactness and simplicity, ! 179: it's not strikingly fast. ! 180: It does give special attention to handling simple cases quickly. ! 181: .SH "SEE ALSO" ! 182: ed(1), ex(1), expr(1), egrep(1), fgrep(1), grep(1), regex(3)
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.