|
|
1.1 root 1: Info file regex, produced by Makeinfo, -*- Text -*- from input file
2: regex.texinfo.
3:
4:
5:
6: File: regex, Node: top, Next: syntax, Up: (dir)
7:
8: "regex" regular expression matching library.
9: ********************************************
10:
11: Overview
12: ========
13:
14: Regular expression matching allows you to test whether a string fits
15: into a specific syntactic shape. You can also search a string for a
16: substring that fits a pattern.
17:
18: A regular expression describes a set of strings. The simplest case
19: is one that describes a particular string; for example, the string
20: `foo' when regarded as a regular expression matches `foo' and nothing
21: else. Nontrivial regular expressions use certain special constructs
22: so that they can match more than one string. For example, the
23: regular expression `foo\|bar' matches either the string `foo' or the
24: string `bar'; the regular expression `c[ad]*r' matches any of the
25: strings `cr', `car', `cdr', `caar', `cadddar' and all other such
26: strings with any number of `a''s and `d''s.
27:
28: The first step in matching a regular expression is to compile it.
29: You must supply the pattern string and also a pattern buffer to hold
30: the compiled result. That result contains the pattern in an internal
31: format that is easier to use in matching.
32:
33: Having compiled a pattern, you can match it against strings. You can
34: match the compiled pattern any number of times against different
35: strings.
36:
37: * Menu:
38:
39: * syntax:: Syntax of regular expressions
40: * directives:: Meaning of characters as regex string directives.
41: * emacs:: Additional character directives available
42: only for use within Emacs.
43: * programming:: Using the regex library from C programs
44: * unix:: Unix-compatible entry-points to regex library
45:
46:
47:
48: File: regex, Node: syntax, Next: directives, Prev: top, Up: top
49:
50: Syntax of Regular Expressions
51: =============================
52:
53: Regular expressions have a syntax in which a few characters are
54: special constructs and the rest are "ordinary". An ordinary
55: character is a simple regular expression which matches that character
56: and nothing else. The special characters are `$', `^', `.', `*',
57: `+', `?', `[', `]' and `\'. Any other character appearing in a
58: regular expression is ordinary, unless a `\' precedes it.
59:
60: For example, `f' is not a special character, so it is ordinary, and
61: therefore `f' is a regular expression that matches the string `f' and
62: no other string. (It does *not* match the string `ff'.) Likewise,
63: `o' is a regular expression that matches only `o'.
64:
65: Any two regular expressions A and B can be concatenated. The result
66: is a regular expression which matches a string if A matches some
67: amount of the beginning of that string and B matches the rest of the
68: string.
69:
70: As a simple example, we can concatenate the regular expressions `f'
71: and `o' to get the regular expression `fo', which matches only the
72: string `fo'. Still trivial.
73:
74: Note: for Unix compatibility, special characters are treated as
75: ordinary ones if they are in contexts where their special meanings
76: make no sense. For example, `*foo' treats `*' as ordinary since
77: there is no preceding expression on which the `*' can act. It is
78: poor practice to depend on this behavior; better to quote the special
79: character anyway, regardless of where is appears.
80:
81:
82:
83: File: regex, Node: directives, Next: emacs, Prev: syntax, Up: top
84:
85: The following are the characters and character sequences which have
86: special meaning within regular expressions. Any character not
87: mentioned here is not special; it stands for exactly itself for the
88: purposes of searching and matching. *Note syntax::.
89:
90: `.'
91: is a special character that matches anything except a newline.
92: Using concatenation, we can make regular expressions like `a.b'
93: which matches any three-character string which begins with `a'
94: and ends with `b'.
95:
96: `*'
97: is not a construct by itself; it is a suffix, which means the
98: preceding regular expression is to be repeated as many times as
99: possible. In `fo*', the `*' applies to the `o', so `fo*'
100: matches `f' followed by any number of `o''s.
101:
102: The case of zero `o''s is allowed: `fo*' does match `f'.
103:
104: `*' always applies to the *smallest* possible preceding
105: expression. Thus, `fo*' has a repeating `o', not a repeating
106: `fo'.
107:
108: The matcher processes a `*' construct by matching, immediately,
109: as many repetitions as can be found. Then it continues with the
110: rest of the pattern. If that fails, backtracking occurs,
111: discarding some of the matches of the `*''d construct in case
112: that makes it possible to match the rest of the pattern. For
113: example, matching `c[ad]*ar' against the string `caddaar', the
114: `[ad]*' first matches `addaa', but this does not allow the next
115: `a' in the pattern to match. So the last of the matches of
116: `[ad]' is undone and the following `a' is tried again. Now it
117: succeeds.
118:
119: `+'
120: `+' is like `*' except that at least one match for the preceding
121: pattern is required for `+'. Thus, `c[ad]+r' does not match
122: `cr' but does match anything else that `c[ad]*r' would match.
123:
124: `?'
125: `?' is like `*' except that it allows either zero or one match
126: for the preceding pattern. Thus, `c[ad]?r' matches `cr' or
127: `car' or `cdr', and nothing else.
128:
129: `[ ... ]'
130: `[' begins a "character set", which is terminated by a `]'. In
131: the simplest case, the characters between the two form the set.
132: Thus, `[ad]' matches either `a' or `d', and `[ad]*' matches any
133: string of `a''s and `d''s (including the empty string), from
134: which it follows that `c[ad]*r' matches `car', etc.
135:
136: Character ranges can also be included in a character set, by
137: writing two characters with a `-' between them. Thus, `[a-z]'
138: matches any lower-case letter. Ranges may be intermixed freely
139: with individual characters, as in `[a-z$%.]', which matches any
140: lower case letter or `$', `%' or period.
141:
142: Note that the usual special characters are not special any more
143: inside a character set. A completely different set of special
144: characters exists inside character sets: `]', `-' and `^'.
145:
146: To include a `]' in a character set, you must make it the first
147: character. For example, `[]a]' matches `]' or `a'. To include
148: a `-', you must use it in a context where it cannot possibly
149: indicate a range: that is, as the first character, or
150: immediately after a range.
151:
152: `[^ ... ]'
153: `[^' begins a "complement character set", which matches any
154: character except the ones specified. Thus, `[^a-z0-9A-Z]'
155: matches all characters *except* letters and digits.
156:
157: `^' is not special in a character set unless it is the first
158: character. The character following the `^' is treated as if it
159: were first (it may be a `-' or a `]').
160:
161: `^'
162: is a special character that matches the empty string -- but only
163: if at the beginning of a line in the text being matched.
164: Otherwise it fails to match anything. Thus, `^foo' matches a
165: `foo' which occurs at the beginning of a line.
166:
167: `$'
168: is similar to `^' but matches only at the end of a line. Thus,
169: `xx*$' matches a string of one or more `x''s at the end of a line.
170:
171: `\'
172: has two functions: it quotes the above special characters
173: (including `\'), and it introduces additional special constructs.
174:
175: Because `\' quotes special characters, `\$' is a regular
176: expression which matches only `$', and `\[' is a regular
177: expression which matches only `[', and so on.
178:
179: For the most part, `\' followed by any character matches only
180: that character. However, there are several exceptions:
181: characters which, when preceded by `\', are special constructs.
182: Such characters are always ordinary when encountered on their own.
183:
184: No new special characters will ever be defined. All extensions
185: to the regular expression syntax are made by defining new
186: two-character constructs that begin with `\'.
187:
188: `\|'
189: specifies an alternative. Two regular expressions A and B with
190: `\|' in between form an expression that matches anything that
191: either A or B will match.
192:
193: Thus, `foo\|bar' matches either `foo' or `bar' but no other
194: string.
195:
196: `\|' applies to the largest possible surrounding expressions.
197: Only a surrounding `\( ... \)' grouping can limit the grouping
198: power of `\|'.
199:
200: Full backtracking capability exists when multiple `\|''s are used.
201:
202: `\( ... \)'
203: is a grouping construct that serves three purposes:
204:
205: 1. To enclose a set of `\|' alternatives for other operations.
206: Thus, `\(foo\|bar\)x' matches either `foox' or `barx'.
207:
208: 2. To enclose a complicated expression for the postfix `*' to
209: operate on. Thus, `ba\(na\)*' matches `bananana', etc.,
210: with any (zero or more) number of `na''s.
211:
212: 3. To mark a matched substring for future reference.
213:
214: This last application is not a consequence of the idea of a
215: parenthetical grouping; it is a separate feature which happens
216: to be assigned as a second meaning to the same `\( ... \)'
217: construct because there is no conflict in practice between the
218: two meanings. Here is an explanation of this feature:
219:
220: `\DIGIT'
221: After the end of a `\( ... \)' construct, the matcher remembers
222: the beginning and end of the text matched by that construct.
223: Then, later on in the regular expression, you can use `\'
224: followed by DIGIT to mean "match the same text matched the
225: DIGIT'th time by the `\( ... \)' construct." The `\( ... \)'
226: constructs are numbered in order of commencement in the regexp.
227:
228: The strings matching the first nine `\( ... \)' constructs
229: appearing in a regular expression are assigned numbers 1 through
230: 9 in order of their beginnings. `\1' through `\9' may be used
231: to refer to the text matched by the corresponding `\( ... \)'
232: construct.
233:
234: For example, `\(.*\)\1' matches any string that is composed of
235: two identical halves. The `\(.*\)' matches the first half,
236: which may be anything, but the `\1' that follows must match the
237: same exact text.
238:
239: `\b'
240: matches the empty string, but only if it is at the beginning or
241: end of a word. Thus, `\bfoo\b' matches any occurrence of `foo'
242: as a separate word. `\bball\(s\|\)\b' matches `ball' or `balls'
243: as a separate word.
244:
245: `\B'
246: matches the empty string, provided it is *not* at the beginning
247: or end of a word.
248:
249: `\<'
250: matches the empty string, but only if it is at the beginning of
251: a word.
252:
253: `\>'
254: matches the empty string, but only if it is at the end of a word.
255:
256: `\w'
257: matches any word-constituent character.
258:
259: `\W'
260: matches any character that is not a word-constituent.
261:
262: There are a number of additional `\' regexp directives available for
263: use within Emacs only.
264:
265: (*note emacs::.).
266:
267:
268:
269: File: regex, Node: emacs, Next: programming, Prev: directives, Up: top
270:
271: Constructs Available in Emacs Only
272: ----------------------------------
273:
274: `\`'
275: matches the empty string, but only if it is at the beginning of
276: the buffer.
277:
278: `\''
279: matches the empty string, but only if it is at the end of the
280: buffer.
281:
282: `\sCODE'
283: matches any character whose syntax is CODE. CODE is a letter
284: which represents a syntax code: thus, `w' for word constituent,
285: `-' for whitespace, `(' for open-parenthesis, etc. See the
286: documentation for the Emacs function `modify-syntax-entry' for
287: further details.
288:
289: Thus, `\s(' matches any character with open-parenthesis syntax.
290:
291: `\SCODE'
292: matches any character whose syntax is not CODE.
293:
294:
295:
296: File: regex, Node: programming, Next: compiling, Prev: emacs, Up: top
297:
298: Programming using the `regex' library
299: =====================================
300:
301: The subnodes accessible from this menu give information on entry
302: points and data structures which C programs need to interface to the
303: `regex' library.
304:
305: * Menu:
306:
307: * compiling:: How to compile regular expressions
308: * matching:: Matching compiled regular expressions
309: * searching:: Searching for compiled regular expressions
310: * translation:: Translating characters into other characters
311: (for both compilation and matching)
312: * registers:: determining what was matched
313: * split:: matching data which is split into two pieces
314: * unix:: Unix-compatible entry-points to regex library
315:
316:
317:
318: File: regex, Node: compiling, Next: matching, Prev: programming, Up: programming
319:
320: Compiling a Regular Expression
321: ------------------------------
322:
323: To compile a regular expression, you must supply a pattern buffer.
324: This is a structure defined, in the include file `regex.h', as
325: follows:
326:
327: struct re_pattern_buffer
328: {
329: char *buffer /* Space holding the compiled pattern commands. */
330: int allocated /* Size of space that buffer points to */
331: int used /* Length of portion of buffer actually occupied */
332: char *fastmap; /* Pointer to fastmap, if any, or zero if none. */
333: /* re_search uses the fastmap, if there is one,
334: to skip quickly over totally implausible
335: characters */
336: char *translate;
337: /* Translate table to apply to characters before
338: comparing, or zero for no translation.
339: The translation is applied to a pattern when
340: it is compiled and to data when it is matched. */
341: char fastmap_accurate;
342: /* Set to zero when a new pattern is stored,
343: set to one when the fastmap is updated from it. */
344: };
345:
346: Before compiling a pattern, you must initialize the `buffer' field to
347: point to a block of memory obtained with `malloc', and the
348: `allocated' field to the size of that block, in bytes. The pattern
349: compiler will replace this block with a larger one if necessary.
350:
351: You must also initialize the `translate' field to point to the
352: translate table that you will use when you match the compiled
353: pattern, or to zero if you will use no translate table when you
354: match. *Note translation::.
355:
356: Then call `re_compile_pattern' to compile a regular expression into
357: the buffer:
358:
359: re_compile_pattern (REGEX, REGEX_SIZE, BUF)
360:
361: REGEX is the address of the regular expression (`char *'), REGEX_SIZE
362: is its length (`int'), BUF is the address of the buffer (`struct
363: re_pattern_buffer *').
364:
365: `re_compile_pattern' returns zero if it succeeds in compiling the
366: regular expression. In that case, `*buf' now contains the results.
367: Otherwise, `re_compile_pattern' returns a string which serves as an
368: error message.
369:
370: After compiling, if you wish to search for the pattern, you must
371: initialize the `fastmap' component of the pattern buffer. *Note
372: searching::.
373:
374:
375:
376: File: regex, Node: matching, Next: searching, Prev: compiling, Up: programming
377:
378: Matching a Compiled Pattern
379: ---------------------------
380:
381: Once a regular expression has been compiled into a pattern buffer,
382: you can match the pattern buffer against a string with `re_match'.
383:
384: re_match (BUF, STRING, SIZE, POS, REGS)
385:
386: BUF is, once again, the address of the buffer (`struct
387: re_pattern_buffer *'). STRING is the string to be matched (`char *').
388: sIZE is the length of that string (`int'). POS is the position
389: within the string at which to begin matching (`int'). The beginning
390: of the string is position 0. REGS is described below. Normally it
391: is zero. *Note registers::.
392:
393: `re_match' returns `-1' if the pattern does not match; otherwise, it
394: returns the length of the portion of `string' which was matched.
395:
396: For example, suppose that BUF points to a buffer containing the
397: result of compiling `x*', STRING points to `xxxxxy', and SIZE is `6'.
398: Suppose that POS is `2'. Then the last three `x''s will be matched,
399: so `re_match' will return `3'. If POS is zero, the value will be `5'.
400: If POS is `5' or `6', the value will be zero, meaning that the null
401: string was successfully matched. Note that since `x*' matches the
402: empty string, it will never entirely fail.
403:
404: It is up to the caller to avoid passing a value of POS that results
405: in matching outside the specified string. POS must not be negative
406: and must not be greater than SIZE.
407:
408:
409:
410: File: regex, Node: searching, Next: translation, Prev: matching, Up: programming
411:
412: Searching for a Match
413: ---------------------
414:
415: Searching means trying successive starting positions for a match
416: until a match is found. To search, you supply a compiled pattern
417: buffer. Before searching you must initialize the `fastmap' field of
418: the pattern buffer (see below).
419:
420: re_search (BUF, STRING, SIZE, STARTPOS, RANGE, REGS)
421:
422: is called like `re_match' except that the POS argument is replaced by
423: two arguments STARTPOS and RANGE. `re_search' tests for a match
424: starting at index STARTPOS, then at `STARTPOS + 1', and so on. It
425: tries RANGE consecutive positions before giving up and returning
426: `-1'. If a match is found, `re_search' returns the index at which
427: the match was found.
428:
429: If RANGE is negative, RE_SEARCH tries starting positions STARTPOS,
430: `STARTPOS - 1', ... in that order. `|RANGE|' is the number of tries
431: made.
432:
433: It is up to the caller to avoid passing value of STARTPOS and RANGE
434: that result in matching outside the specified string. STARTPOS must
435: be between zero and SIZE, inclusive, and so must `STARTPOS + RANGE -
436: 1' (if RANGE is positive) or `STARTPOS + RANGE + 1' (if RANGE is
437: negative).
438:
439: If you may be searching over a long distance (that is, trying many
440: different match starting points) with a compiled pattern, you should
441: use a "fastmap" in it. This is a block of 256 bytes, whose address
442: is placed in the `fastmap' component of the pattern buffer. The
443: first time you search for a particular compiled pattern, the fastmap
444: is set so that `FASTMAP[CH]' is nonzero if the character CH might
445: possibly start a match for this pattern. `re_search' checks each
446: character against the fastmap so that it can skip more quickly over
447: non-matches.
448:
449: If you do not want a fastmap, store zero in the `fastmap' component
450: of the pattern buffer before calling `re_search'.
451:
452: In either case, you must initialize this component in a pattern
453: buffer before you can use that buffer in a search; but you can choose
454: as an initial value either zero or the address of a suitable block of
455: memory.
456:
457: If you compile a new pattern in an existing pattern buffer, it is not
458: necessary to reinitialize the `fastmap' component (unless you wish to
459: override your previous choice).
460:
461:
462:
463: File: regex, Node: translation, Next: registers, Prev: searching, Up: programming
464:
465: Translate Tables
466: ----------------
467:
468: With a translate table, you can apply a transformation to all
469: characters before they are compared. For example, a table that maps
470: lower case letters into upper case (or vice versa) causes differences
471: in case to be ignored by matching.
472:
473: A translate table is a block of 256 bytes. Each character of raw
474: data is used as an index in the translate table. The value found
475: there is used instead of the original character. Each character in a
476: regular expression, except for the syntactic constructs, is
477: translated when the expression is compiled. Each character of a
478: string being matched is translated whenever it is compared or tested.
479:
480: A suitable translate table to ignore differences in case maps all
481: characters into themselves, except for lower case letters, which are
482: mapped into the corresponding upper case letters. It could be
483: initialized by:
484:
485: for (i = 0; i < 0400; i++)
486: table[i] = i;
487: for (i = 'a'; i <= 'z'; i++)
488: table[i] = i - 040;
489:
490: You specify the use of a translate table by putting its address in
491: the TRANSLATE component of the compiled pattern buffer. If this
492: component is zero, no translation is done. Since both compilation
493: and matching use the translate table, you must use the same table
494: contents for both operations or confusing things will happen.
495:
496:
497:
498: File: regex, Node: registers, Next: split, Prev: translation, Up: programming
499:
500: Registers: or "What Did the `\( ... \)' Groupings Actually Match?"
501: ------------------------------------------------------------------
502:
503: If you want to find out, after the match, what each of the first nine
504: `\( ... \)' groupings actually matched, you can pass the REGS
505: argument to the match or search function. Pass the address of a
506: structure of this type:
507:
508: struct re_registers
509: {
510: int start[RE_NREGS];
511: int end[RE_NREGS];
512: };
513:
514: `re_match' and `re_search' will store into this structure the data
515: you want. `REGS->start[REG]' will be the index in STRING of the
516: beginning of the data matched by the REG'th `\( ... \)' grouping, and
517: `REGS->end[REG]' will be the index of the end of that data (the index
518: of the first character beyond those matched). The values in the
519: start and end arrays at indexes greater than the number of `\( ...
520: \)' groupings present in the regular expression will be set to the
521: value -1. Register numbers start at 1 and run to `RE_NREGS - 1'
522: (normally `9'). `REGS->start[0]' and `REGS->end[0]' are similar but
523: describe the extent of the substring matched by the entire pattern.
524:
525: Both `struct re_registers' and `RE_NREGS' are defined in `regex.h'.
526:
527:
528:
529: File: regex, Node: split, Next: unix, Prev: registers, Up: programming
530:
531: Matching against Split Data
532: ---------------------------
533:
534: The functions `re_match_2' and `re_search_2' allow one to match in or
535: search data which is divided into two strings.
536:
537: `re_match_2' works like `re_match' except that two data strings and
538: sizes must be given.
539:
540: re_match_2 (BUF, STRING1, SIZE1, STRING2, SIZE2, POS, REGS)
541:
542: The matcher regards the contents of STRING1 as effectively followed
543: by the contents of STRING2, and matches the combined string against
544: the pattern in BUF.
545:
546: `re_search_2' is likewise similar to `re_search':
547:
548: re_search_2 (BUF, STRING1, SIZE1, STRING2, SIZE2, STARTPOS, RANGE, REGS)
549:
550: The value returned by RE_SEARCH_2 is an index into the combined data
551: made up of STRING1 and STRING2. It never exceeds `SIZE1 + SIZE2'.
552: The values returned in the REGS structure (if there is one) are
553: likewise indices in the combined data.
554:
555:
556:
557: File: regex, Node: unix, Prev: split, Up: programming
558:
559: Unix-Compatible Entry Points
560: ----------------------------
561:
562: The standard Berkeley Unix way to compile a regular expression is to
563: call `re_comp'. This function takes a single argument, the address
564: of the regular expression, which is assumed to be terminated by a
565: null character.
566:
567: `re_comp' does not ask you to specify a pattern buffer because it has
568: its own pattern buffer -- just one. Using `re_comp', one may match
569: only the most recently compiled regular expression.
570:
571: The value of `re_comp' is zero for success or else an error message
572: string, as for `re_compile_pattern'.
573:
574: Calling `re_comp' with the null string as argument it has no effect;
575: the contents of the buffer remain unchanged.
576:
577: The standard Berkeley Unix way to match the last regular expression
578: compiled is to call `re_exec'. This takes a single argument, the
579: address of the string to be matched. This string is assumed to be
580: terminated by a null character. Matching is tried starting at each
581: position in the string. `re_exec' returns `1' for success or `0' for
582: failure. One cannot find out how long a substring was matched, nor
583: what the `\( ... \)' groupings matched.
584:
585:
586:
587: Tag Table:
588: Node: top85
589: Node: syntax1706
590: Node: directives3241
591: Node: emacs10891
592: Node: programming11653
593: Node: compiling12381
594: Node: matching14827
595: Node: searching16269
596: Node: translation18534
597: Node: registers19952
598: Node: split21245
599: Node: unix22183
600:
601: End Tag Table
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.