|
|
1.1 root 1: .TH FLEX 1 "26 May 1990" "Version 2.3"
2: .SH NAME
3: flex - fast lexical analyzer generator
4: .SH SYNOPSIS
5: .B flex
6: .B [-bcdfinpstvFILT8 -C[efmF] -Sskeleton]
7: .I [filename ...]
8: .SH DESCRIPTION
9: .I flex
10: is a tool for generating
11: .I scanners:
12: programs which recognized lexical patterns in text.
13: .I flex
14: reads
15: the given input files, or its standard input if no file names are given,
16: for a description of a scanner to generate. The description is in
17: the form of pairs
18: of regular expressions and C code, called
19: .I rules. flex
20: generates as output a C source file,
21: .B lex.yy.c,
22: which defines a routine
23: .B yylex().
24: This file is compiled and linked with the
25: .B -lfl
26: library to produce an executable. When the executable is run,
27: it analyzes its input for occurrences
28: of the regular expressions. Whenever it finds one, it executes
29: the corresponding C code.
30: .SH SOME SIMPLE EXAMPLES
31: .LP
32: First some simple examples to get the flavor of how one uses
33: .I flex.
34: The following
35: .I flex
36: input specifies a scanner which whenever it encounters the string
37: "username" will replace it with the user's login name:
38: .nf
39:
40: %%
41: username printf( "%s", getlogin() );
42:
43: .fi
44: By default, any text not matched by a
45: .I flex
46: scanner
47: is copied to the output, so the net effect of this scanner is
48: to copy its input file to its output with each occurrence
49: of "username" expanded.
50: In this input, there is just one rule. "username" is the
51: .I pattern
52: and the "printf" is the
53: .I action.
54: The "%%" marks the beginning of the rules.
55: .LP
56: Here's another simple example:
57: .nf
58:
59: int num_lines = 0, num_chars = 0;
60:
61: %%
62: \\n ++num_lines; ++num_chars;
63: . ++num_chars;
64:
65: %%
66: main()
67: {
68: yylex();
69: printf( "# of lines = %d, # of chars = %d\\n",
70: num_lines, num_chars );
71: }
72:
73: .fi
74: This scanner counts the number of characters and the number
75: of lines in its input (it produces no output other than the
76: final report on the counts). The first line
77: declares two globals, "num_lines" and "num_chars", which are accessible
78: both inside
79: .B yylex()
80: and in the
81: .B main()
82: routine declared after the second "%%". There are two rules, one
83: which matches a newline ("\\n") and increments both the line count and
84: the character count, and one which matches any character other than
85: a newline (indicated by the "." regular expression).
86: .LP
87: A somewhat more complicated example:
88: .nf
89:
90: /* scanner for a toy Pascal-like language */
91:
92: %{
93: /* need this for the call to atof() below */
94: #include <math.h>
95: %}
96:
97: DIGIT [0-9]
98: ID [a-z][a-z0-9]*
99:
100: %%
101:
102: {DIGIT}+ {
103: printf( "An integer: %s (%d)\\n", yytext,
104: atoi( yytext ) );
105: }
106:
107: {DIGIT}+"."{DIGIT}* {
108: printf( "A float: %s (%d)\\n", yytext,
109: atof( yytext ) );
110: }
111:
112: if|then|begin|end|procedure|function {
113: printf( "A keyword: %s\\n", yytext );
114: }
115:
116: {ID} printf( "An identifier: %s\\n", yytext );
117:
118: "+"|"-"|"*"|"/" printf( "An operator: %s\\n", yytext );
119:
120: "{"[^}\\n]*"}" /* eat up one-line comments */
121:
122: [ \\t\\n]+ /* eat up whitespace */
123:
124: . printf( "Unrecognized character: %s\\n", yytext );
125:
126: %%
127:
128: main( argc, argv )
129: int argc;
130: char **argv;
131: {
132: ++argv, --argc; /* skip over program name */
133: if ( argc > 0 )
134: yyin = fopen( argv[0], "r" );
135: else
136: yyin = stdin;
137:
138: yylex();
139: }
140:
141: .fi
142: This is the beginnings of a simple scanner for a language like
143: Pascal. It identifies different types of
144: .I tokens
145: and reports on what it has seen.
146: .LP
147: The details of this example will be explained in the following
148: sections.
149: .SH FORMAT OF THE INPUT FILE
150: The
151: .I flex
152: input file consists of three sections, separated by a line with just
153: .B %%
154: in it:
155: .nf
156:
157: definitions
158: %%
159: rules
160: %%
161: user code
162:
163: .fi
164: The
165: .I definitions
166: section contains declarations of simple
167: .I name
168: definitions to simplify the scanner specification, and declarations of
169: .I start conditions,
170: which are explained in a later section.
171: .LP
172: Name definitions have the form:
173: .nf
174:
175: name definition
176:
177: .fi
178: The "name" is a word beginning with a letter or an underscore ('_')
179: followed by zero or more letters, digits, '_', or '-' (dash).
180: The definition is taken to begin at the first non-white-space character
181: following the name and continuing to the end of the line.
182: The definition can subsequently be referred to using "{name}", which
183: will expand to "(definition)". For example,
184: .nf
185:
186: DIGIT [0-9]
187: ID [a-z][a-z0-9]*
188:
189: .fi
190: defines "DIGIT" to be a regular expression which matches a
191: single digit, and
192: "ID" to be a regular expression which matches a letter
193: followed by zero-or-more letters-or-digits.
194: A subsequent reference to
195: .nf
196:
197: {DIGIT}+"."{DIGIT}*
198:
199: .fi
200: is identical to
201: .nf
202:
203: ([0-9])+"."([0-9])*
204:
205: .fi
206: and matches one-or-more digits followed by a '.' followed
207: by zero-or-more digits.
208: .LP
209: The
210: .I rules
211: section of the
212: .I flex
213: input contains a series of rules of the form:
214: .nf
215:
216: pattern action
217:
218: .fi
219: where the pattern must be unindented and the action must begin
220: on the same line.
221: .LP
222: See below for a further description of patterns and actions.
223: .LP
224: Finally, the user code section is simply copied to
225: .B lex.yy.c
226: verbatim.
227: It is used for companion routines which call or are called
228: by the scanner. The presence of this section is optional;
229: if it is missing, the second
230: .B %%
231: in the input file may be skipped, too.
232: .LP
233: In the definitions and rules sections, any
234: .I indented
235: text or text enclosed in
236: .B %{
237: and
238: .B %}
239: is copied verbatim to the output (with the %{}'s removed).
240: The %{}'s must appear unindented on lines by themselves.
241: .LP
242: In the rules section,
243: any indented or %{} text appearing before the
244: first rule may be used to declare variables
245: which are local to the scanning routine and (after the declarations)
246: code which is to be executed whenever the scanning routine is entered.
247: Other indented or %{} text in the rule section is still copied to the output,
248: but its meaning is not well-defined and it may well cause compile-time
249: errors (this feature is present for
250: .I POSIX
251: compliance; see below for other such features).
252: .LP
253: In the definitions section, an unindented comment (i.e., a line
254: beginning with "/*") is also copied verbatim to the output up
255: to the next "*/". Also, any line in the definitions section
256: beginning with '#' is ignored, though this style of comment is
257: deprecated and may go away in the future.
258: .SH PATTERNS
259: The patterns in the input are written using an extended set of regular
260: expressions. These are:
261: .nf
262:
263: x match the character 'x'
264: . any character except newline
265: [xyz] a "character class"; in this case, the pattern
266: matches either an 'x', a 'y', or a 'z'
267: [abj-oZ] a "character class" with a range in it; matches
268: an 'a', a 'b', any letter from 'j' through 'o',
269: or a 'Z'
270: [^A-Z] a "negated character class", i.e., any character
271: but those in the class. In this case, any
272: character EXCEPT an uppercase letter.
273: [^A-Z\\n] any character EXCEPT an uppercase letter or
274: a newline
275: r* zero or more r's, where r is any regular expression
276: r+ one or more r's
277: r? zero or one r's (that is, "an optional r")
278: r{2,5} anywhere from two to five r's
279: r{2,} two or more r's
280: r{4} exactly 4 r's
281: {name} the expansion of the "name" definition
282: (see above)
283: "[xyz]\\"foo"
284: the literal string: [xyz]"foo
285: \\X if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v',
286: then the ANSI-C interpretation of \\x.
287: Otherwise, a literal 'X' (used to escape
288: operators such as '*')
289: \\123 the character with octal value 123
290: \\x2a the character with hexadecimal value 2a
291: (r) match an r; parentheses are used to override
292: precedence (see below)
293:
294:
295: rs the regular expression r followed by the
296: regular expression s; called "concatenation"
297:
298:
299: r|s either an r or an s
300:
301:
302: r/s an r but only if it is followed by an s. The
303: s is not part of the matched text. This type
304: of pattern is called as "trailing context".
305: ^r an r, but only at the beginning of a line
306: r$ an r, but only at the end of a line. Equivalent
307: to "r/\\n".
308:
309:
310: <s>r an r, but only in start condition s (see
311: below for discussion of start conditions)
312: <s1,s2,s3>r
313: same, but in any of start conditions s1,
314: s2, or s3
315:
316:
317: <<EOF>> an end-of-file
318: <s1,s2><<EOF>>
319: an end-of-file when in start condition s1 or s2
320:
321: .fi
322: The regular expressions listed above are grouped according to
323: precedence, from highest precedence at the top to lowest at the bottom.
324: Those grouped together have equal precedence. For example,
325: .nf
326:
327: foo|bar*
328:
329: .fi
330: is the same as
331: .nf
332:
333: (foo)|(ba(r*))
334:
335: .fi
336: since the '*' operator has higher precedence than concatenation,
337: and concatenation higher than alternation ('|'). This pattern
338: therefore matches
339: .I either
340: the string "foo"
341: .I or
342: the string "ba" followed by zero-or-more r's.
343: To match "foo" or zero-or-more "bar"'s, use:
344: .nf
345:
346: foo|(bar)*
347:
348: .fi
349: and to match zero-or-more "foo"'s-or-"bar"'s:
350: .nf
351:
352: (foo|bar)*
353:
354: .fi
355: .LP
356: Some notes on patterns:
357: .IP -
358: A negated character class such as the example "[^A-Z]"
359: above
360: .I will match a newline
361: unless "\\n" (or an equivalent escape sequence) is one of the
362: characters explicitly present in the negated character class
363: (e.g., "[^A-Z\\n]"). This is unlike how many other regular
364: expression tools treat negated character classes, but unfortunately
365: the inconsistency is historically entrenched.
366: Matching newlines means that a pattern like [^"]* can match an entire
367: input (overflowing the scanner's input buffer) unless there's another
368: quote in the input.
369: .IP -
370: A rule can have at most one instance of trailing context (the '/' operator
371: or the '$' operator). The start condition, '^', and "<<EOF>>" patterns
372: can only occur at the beginning of a pattern, and, as well as with '/' and '$',
373: cannot be grouped inside parentheses. A '^' which does not occur at
374: the beginning of a rule or a '$' which does not occur at the end of
375: a rule loses its special properties and is treated as a normal character.
376: .IP
377: The following are illegal:
378: .nf
379:
380: foo/bar$
381: <sc1>foo<sc2>bar
382:
383: .fi
384: Note that the first of these, can be written "foo/bar\\n".
385: .IP
386: The following will result in '$' or '^' being treated as a normal character:
387: .nf
388:
389: foo|(bar$)
390: foo|^bar
391:
392: .fi
393: If what's wanted is a "foo" or a bar-followed-by-a-newline, the following
394: could be used (the special '|' action is explained below):
395: .nf
396:
397: foo |
398: bar$ /* action goes here */
399:
400: .fi
401: A similar trick will work for matching a foo or a
402: bar-at-the-beginning-of-a-line.
403: .SH HOW THE INPUT IS MATCHED
404: When the generated scanner is run, it analyzes its input looking
405: for strings which match any of its patterns. If it finds more than
406: one match, it takes the one matching the most text (for trailing
407: context rules, this includes the length of the trailing part, even
408: though it will then be returned to the input). If it finds two
409: or more matches of the same length, the
410: rule listed first in the
411: .I flex
412: input file is chosen.
413: .LP
414: Once the match is determined, the text corresponding to the match
415: (called the
416: .I token)
417: is made available in the global character pointer
418: .B yytext,
419: and its length in the global integer
420: .B yyleng.
421: The
422: .I action
423: corresponding to the matched pattern is then executed (a more
424: detailed description of actions follows), and then the remaining
425: input is scanned for another match.
426: .LP
427: If no match is found, then the
428: .I default rule
429: is executed: the next character in the input is considered matched and
430: copied to the standard output. Thus, the simplest legal
431: .I flex
432: input is:
433: .nf
434:
435: %%
436:
437: .fi
438: which generates a scanner that simply copies its input (one character
439: at a time) to its output.
440: .SH ACTIONS
441: Each pattern in a rule has a corresponding action, which can be any
442: arbitrary C statement. The pattern ends at the first non-escaped
443: whitespace character; the remainder of the line is its action. If the
444: action is empty, then when the pattern is matched the input token
445: is simply discarded. For example, here is the specification for a program
446: which deletes all occurrences of "zap me" from its input:
447: .nf
448:
449: %%
450: "zap me"
451:
452: .fi
453: (It will copy all other characters in the input to the output since
454: they will be matched by the default rule.)
455: .LP
456: Here is a program which compresses multiple blanks and tabs down to
457: a single blank, and throws away whitespace found at the end of a line:
458: .nf
459:
460: %%
461: [ \\t]+ putchar( ' ' );
462: [ \\t]+$ /* ignore this token */
463:
464: .fi
465: .LP
466: If the action contains a '{', then the action spans till the balancing '}'
467: is found, and the action may cross multiple lines.
468: .I flex
469: knows about C strings and comments and won't be fooled by braces found
470: within them, but also allows actions to begin with
471: .B %{
472: and will consider the action to be all the text up to the next
473: .B %}
474: (regardless of ordinary braces inside the action).
475: .LP
476: An action consisting solely of a vertical bar ('|') means "same as
477: the action for the next rule." See below for an illustration.
478: .LP
479: Actions can include arbitrary C code, including
480: .B return
481: statements to return a value to whatever routine called
482: .B yylex().
483: Each time
484: .B yylex()
485: is called it continues processing tokens from where it last left
486: off until it either reaches
487: the end of the file or executes a return. Once it reaches an end-of-file,
488: however, then any subsequent call to
489: .B yylex()
490: will simply immediately return, unless
491: .B yyrestart()
492: is first called (see below).
493: .LP
494: Actions are not allowed to modify yytext or yyleng.
495: .LP
496: There are a number of special directives which can be included within
497: an action:
498: .IP -
499: .B ECHO
500: copies yytext to the scanner's output.
501: .IP -
502: .B BEGIN
503: followed by the name of a start condition places the scanner in the
504: corresponding start condition (see below).
505: .IP -
506: .B REJECT
507: directs the scanner to proceed on to the "second best" rule which matched the
508: input (or a prefix of the input). The rule is chosen as described
509: above in "How the Input is Matched", and
510: .B yytext
511: and
512: .B yyleng
513: set up appropriately.
514: It may either be one which matched as much text
515: as the originally chosen rule but came later in the
516: .I flex
517: input file, or one which matched less text.
518: For example, the following will both count the
519: words in the input and call the routine special() whenever "frob" is seen:
520: .nf
521:
522: int word_count = 0;
523: %%
524:
525: frob special(); REJECT;
526: [^ \\t\\n]+ ++word_count;
527:
528: .fi
529: Without the
530: .B REJECT,
531: any "frob"'s in the input would not be counted as words, since the
532: scanner normally executes only one action per token.
533: Multiple
534: .B REJECT's
535: are allowed, each one finding the next best choice to the currently
536: active rule. For example, when the following scanner scans the token
537: "abcd", it will write "abcdabcaba" to the output:
538: .nf
539:
540: %%
541: a |
542: ab |
543: abc |
544: abcd ECHO; REJECT;
545: .|\\n /* eat up any unmatched character */
546:
547: .fi
548: (The first three rules share the fourth's action since they use
549: the special '|' action.)
550: .B REJECT
551: is a particularly expensive feature in terms scanner performance;
552: if it is used in
553: .I any
554: of the scanner's actions it will slow down
555: .I all
556: of the scanner's matching. Furthermore,
557: .B REJECT
558: cannot be used with the
559: .I -f
560: or
561: .I -F
562: options (see below).
563: .IP
564: Note also that unlike the other special actions,
565: .B REJECT
566: is a
567: .I branch;
568: code immediately following it in the action will
569: .I not
570: be executed.
571: .IP -
572: .B yymore()
573: tells the scanner that the next time it matches a rule, the corresponding
574: token should be
575: .I appended
576: onto the current value of
577: .B yytext
578: rather than replacing it. For example, given the input "mega-kludge"
579: the following will write "mega-mega-kludge" to the output:
580: .nf
581:
582: %%
583: mega- ECHO; yymore();
584: kludge ECHO;
585:
586: .fi
587: First "mega-" is matched and echoed to the output. Then "kludge"
588: is matched, but the previous "mega-" is still hanging around at the
589: beginning of
590: .B yytext
591: so the
592: .B ECHO
593: for the "kludge" rule will actually write "mega-kludge".
594: The presence of
595: .B yymore()
596: in the scanner's action entails a minor performance penalty in the
597: scanner's matching speed.
598: .IP -
599: .B yyless(n)
600: returns all but the first
601: .I n
602: characters of the current token back to the input stream, where they
603: will be rescanned when the scanner looks for the next match.
604: .B yytext
605: and
606: .B yyleng
607: are adjusted appropriately (e.g.,
608: .B yyleng
609: will now be equal to
610: .I n
611: ). For example, on the input "foobar" the following will write out
612: "foobarbar":
613: .nf
614:
615: %%
616: foobar ECHO; yyless(3);
617: [a-z]+ ECHO;
618:
619: .fi
620: An argument of 0 to
621: .B yyless
622: will cause the entire current input string to be scanned again. Unless you've
623: changed how the scanner will subsequently process its input (using
624: .B BEGIN,
625: for example), this will result in an endless loop.
626: .IP -
627: .B unput(c)
628: puts the character
629: .I c
630: back onto the input stream. It will be the next character scanned.
631: The following action will take the current token and cause it
632: to be rescanned enclosed in parentheses.
633: .nf
634:
635: {
636: int i;
637: unput( ')' );
638: for ( i = yyleng - 1; i >= 0; --i )
639: unput( yytext[i] );
640: unput( '(' );
641: }
642:
643: .fi
644: Note that since each
645: .B unput()
646: puts the given character back at the
647: .I beginning
648: of the input stream, pushing back strings must be done back-to-front.
649: .IP -
650: .B input()
651: reads the next character from the input stream. For example,
652: the following is one way to eat up C comments:
653: .nf
654:
655: %%
656: "/*" {
657: register int c;
658:
659: for ( ; ; )
660: {
661: while ( (c = input()) != '*' &&
662: c != EOF )
663: ; /* eat up text of comment */
664:
665: if ( c == '*' )
666: {
667: while ( (c = input()) == '*' )
668: ;
669: if ( c == '/' )
670: break; /* found the end */
671: }
672:
673: if ( c == EOF )
674: {
675: error( "EOF in comment" );
676: break;
677: }
678: }
679: }
680:
681: .fi
682: (Note that if the scanner is compiled using
683: .B C++,
684: then
685: .B input()
686: is instead referred to as
687: .B yyinput(),
688: in order to avoid a name clash with the
689: .B C++
690: stream by the name of
691: .I input.)
692: .IP -
693: .B yyterminate()
694: can be used in lieu of a return statement in an action. It terminates
695: the scanner and returns a 0 to the scanner's caller, indicating "all done".
696: Subsequent calls to the scanner will immediately return unless preceded
697: by a call to
698: .B yyrestart()
699: (see below).
700: By default,
701: .B yyterminate()
702: is also called when an end-of-file is encountered. It is a macro and
703: may be redefined.
704: .SH THE GENERATED SCANNER
705: The output of
706: .I flex
707: is the file
708: .B lex.yy.c,
709: which contains the scanning routine
710: .B yylex(),
711: a number of tables used by it for matching tokens, and a number
712: of auxiliary routines and macros. By default,
713: .B yylex()
714: is declared as follows:
715: .nf
716:
717: int yylex()
718: {
719: ... various definitions and the actions in here ...
720: }
721:
722: .fi
723: (If your environment supports function prototypes, then it will
724: be "int yylex( void )".) This definition may be changed by redefining
725: the "YY_DECL" macro. For example, you could use:
726: .nf
727:
728: #undef YY_DECL
729: #define YY_DECL float lexscan( a, b ) float a, b;
730:
731: .fi
732: to give the scanning routine the name
733: .I lexscan,
734: returning a float, and taking two floats as arguments. Note that
735: if you give arguments to the scanning routine using a
736: K&R-style/non-prototyped function declaration, you must terminate
737: the definition with a semi-colon (;).
738: .LP
739: Whenever
740: .B yylex()
741: is called, it scans tokens from the global input file
742: .I yyin
743: (which defaults to stdin). It continues until it either reaches
744: an end-of-file (at which point it returns the value 0) or
745: one of its actions executes a
746: .I return
747: statement.
748: In the former case, when called again the scanner will immediately
749: return unless
750: .B yyrestart()
751: is called to point
752: .I yyin
753: at the new input file. (
754: .B yyrestart()
755: takes one argument, a
756: .B FILE *
757: pointer.)
758: In the latter case (i.e., when an action
759: executes a return), the scanner may then be called again and it
760: will resume scanning where it left off.
761: .LP
762: By default (and for purposes of efficiency), the scanner uses
763: block-reads rather than simple
764: .I getc()
765: calls to read characters from
766: .I yyin.
767: The nature of how it gets its input can be controlled by redefining the
768: .B YY_INPUT
769: macro.
770: YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)". Its
771: action is to place up to
772: .I max_size
773: characters in the character array
774: .I buf
775: and return in the integer variable
776: .I result
777: either the
778: number of characters read or the constant YY_NULL (0 on Unix systems)
779: to indicate EOF. The default YY_INPUT reads from the
780: global file-pointer "yyin".
781: .LP
782: A sample redefinition of YY_INPUT (in the definitions
783: section of the input file):
784: .nf
785:
786: %{
787: #undef YY_INPUT
788: #define YY_INPUT(buf,result,max_size) \\
789: result = ((buf[0] = getchar()) == EOF) ? YY_NULL : 1;
790: %}
791:
792: .fi
793: This definition will change the input processing to occur
794: one character at a time.
795: .LP
796: You also can add in things like keeping track of the
797: input line number this way; but don't expect your scanner to
798: go very fast.
799: .LP
800: When the scanner receives an end-of-file indication from YY_INPUT,
801: it then checks the
802: .B yywrap()
803: function. If
804: .B yywrap()
805: returns false (zero), then it is assumed that the
806: function has gone ahead and set up
807: .I yyin
808: to point to another input file, and scanning continues. If it returns
809: true (non-zero), then the scanner terminates, returning 0 to its
810: caller.
811: .LP
812: The default
813: .B yywrap()
814: always returns 1. Presently, to redefine it you must first
815: "#undef yywrap", as it is currently implemented as a macro. As indicated
816: by the hedging in the previous sentence, it may be changed to
817: a true function in the near future.
818: .LP
819: The scanner writes its
820: .B ECHO
821: output to the
822: .I yyout
823: global (default, stdout), which may be redefined by the user simply
824: by assigning it to some other
825: .B FILE
826: pointer.
827: .SH START CONDITIONS
828: .I flex
829: provides a mechanism for conditionally activating rules. Any rule
830: whose pattern is prefixed with "<sc>" will only be active when
831: the scanner is in the start condition named "sc". For example,
832: .nf
833:
834: <STRING>[^"]* { /* eat up the string body ... */
835: ...
836: }
837:
838: .fi
839: will be active only when the scanner is in the "STRING" start
840: condition, and
841: .nf
842:
843: <INITIAL,STRING,QUOTE>\\. { /* handle an escape ... */
844: ...
845: }
846:
847: .fi
848: will be active only when the current start condition is
849: either "INITIAL", "STRING", or "QUOTE".
850: .LP
851: Start conditions
852: are declared in the definitions (first) section of the input
853: using unindented lines beginning with either
854: .B %s
855: or
856: .B %x
857: followed by a list of names.
858: The former declares
859: .I inclusive
860: start conditions, the latter
861: .I exclusive
862: start conditions. A start condition is activated using the
863: .B BEGIN
864: action. Until the next
865: .B BEGIN
866: action is executed, rules with the given start
867: condition will be active and
868: rules with other start conditions will be inactive.
869: If the start condition is
870: .I inclusive,
871: then rules with no start conditions at all will also be active.
872: If it is
873: .I exclusive,
874: then
875: .I only
876: rules qualified with the start condition will be active.
877: A set of rules contingent on the same exclusive start condition
878: describe a scanner which is independent of any of the other rules in the
879: .I flex
880: input. Because of this,
881: exclusive start conditions make it easy to specify "mini-scanners"
882: which scan portions of the input that are syntactically different
883: from the rest (e.g., comments).
884: .LP
885: If the distinction between inclusive and exclusive start conditions
886: is still a little vague, here's a simple example illustrating the
887: connection between the two. The set of rules:
888: .nf
889:
890: %s example
891: %%
892: <example>foo /* do something */
893:
894: .fi
895: is equivalent to
896: .nf
897:
898: %x example
899: %%
900: <INITIAL,example>foo /* do something */
901:
902: .fi
903: .LP
904: The default rule (to
905: .B ECHO
906: any unmatched character) remains active in start conditions.
907: .LP
908: .B BEGIN(0)
909: returns to the original state where only the rules with
910: no start conditions are active. This state can also be
911: referred to as the start-condition "INITIAL", so
912: .B BEGIN(INITIAL)
913: is equivalent to
914: .B BEGIN(0).
915: (The parentheses around the start condition name are not required but
916: are considered good style.)
917: .LP
918: .B BEGIN
919: actions can also be given as indented code at the beginning
920: of the rules section. For example, the following will cause
921: the scanner to enter the "SPECIAL" start condition whenever
922: .I yylex()
923: is called and the global variable
924: .I enter_special
925: is true:
926: .nf
927:
928: int enter_special;
929:
930: %x SPECIAL
931: %%
932: if ( enter_special )
933: BEGIN(SPECIAL);
934:
935: <SPECIAL>blahblahblah
936: ...more rules follow...
937:
938: .fi
939: .LP
940: To illustrate the uses of start conditions,
941: here is a scanner which provides two different interpretations
942: of a string like "123.456". By default it will treat it as
943: as three tokens, the integer "123", a dot ('.'), and the integer "456".
944: But if the string is preceded earlier in the line by the string
945: "expect-floats"
946: it will treat it as a single token, the floating-point number
947: 123.456:
948: .nf
949:
950: %{
951: #include <math.h>
952: %}
953: %s expect
954:
955: %%
956: expect-floats BEGIN(expect);
957:
958: <expect>[0-9]+"."[0-9]+ {
959: printf( "found a float, = %f\\n",
960: atof( yytext ) );
961: }
962: <expect>\\n {
963: /* that's the end of the line, so
964: * we need another "expect-number"
965: * before we'll recognize any more
966: * numbers
967: */
968: BEGIN(INITIAL);
969: }
970:
971: [0-9]+ {
972: printf( "found an integer, = %d\\n",
973: atoi( yytext ) );
974: }
975:
976: "." printf( "found a dot\\n" );
977:
978: .fi
979: Here is a scanner which recognizes (and discards) C comments while
980: maintaining a count of the current input line.
981: .nf
982:
983: %x comment
984: %%
985: int line_num = 1;
986:
987: "/*" BEGIN(comment);
988:
989: <comment>[^*\\n]* /* eat anything that's not a '*' */
990: <comment>"*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */
991: <comment>\\n ++line_num;
992: <comment>"*"+"/" BEGIN(INITIAL);
993:
994: .fi
995: Note that start-conditions names are really integer values and
996: can be stored as such. Thus, the above could be extended in the
997: following fashion:
998: .nf
999:
1000: %x comment foo
1001: %%
1002: int line_num = 1;
1003: int comment_caller;
1004:
1005: "/*" {
1006: comment_caller = INITIAL;
1007: BEGIN(comment);
1008: }
1009:
1010: ...
1011:
1012: <foo>"/*" {
1013: comment_caller = foo;
1014: BEGIN(comment);
1015: }
1016:
1017: <comment>[^*\\n]* /* eat anything that's not a '*' */
1018: <comment>"*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */
1019: <comment>\\n ++line_num;
1020: <comment>"*"+"/" BEGIN(comment_caller);
1021:
1022: .fi
1023: One can then implement a "stack" of start conditions using an
1024: array of integers. (It is likely that such stacks will become
1025: a full-fledged
1026: .I flex
1027: feature in the future.) Note, though, that
1028: start conditions do not have their own name-space; %s's and %x's
1029: declare names in the same fashion as #define's.
1030: .SH MULTIPLE INPUT BUFFERS
1031: Some scanners (such as those which support "include" files)
1032: require reading from several input streams. As
1033: .I flex
1034: scanners do a large amount of buffering, one cannot control
1035: where the next input will be read from by simply writing a
1036: .B YY_INPUT
1037: which is sensitive to the scanning context.
1038: .B YY_INPUT
1039: is only called when the scanner reaches the end of its buffer, which
1040: may be a long time after scanning a statement such as an "include"
1041: which requires switching the input source.
1042: .LP
1043: To negotiate these sorts of problems,
1044: .I flex
1045: provides a mechanism for creating and switching between multiple
1046: input buffers. An input buffer is created by using:
1047: .nf
1048:
1049: YY_BUFFER_STATE yy_create_buffer( FILE *file, int size )
1050:
1051: .fi
1052: which takes a
1053: .I FILE
1054: pointer and a size and creates a buffer associated with the given
1055: file and large enough to hold
1056: .I size
1057: characters (when in doubt, use
1058: .B YY_BUF_SIZE
1059: for the size). It returns a
1060: .B YY_BUFFER_STATE
1061: handle, which may then be passed to other routines:
1062: .nf
1063:
1064: void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer )
1065:
1066: .fi
1067: switches the scanner's input buffer so subsequent tokens will
1068: come from
1069: .I new_buffer.
1070: Note that
1071: .B yy_switch_to_buffer()
1072: may be used by yywrap() to sets things up for continued scanning, instead
1073: of opening a new file and pointing
1074: .I yyin
1075: at it.
1076: .nf
1077:
1078: void yy_delete_buffer( YY_BUFFER_STATE buffer )
1079:
1080: .fi
1081: is used to reclaim the storage associated with a buffer.
1082: .LP
1083: .B yy_new_buffer()
1084: is an alias for
1085: .B yy_create_buffer(),
1086: provided for compatibility with the C++ use of
1087: .I new
1088: and
1089: .I delete
1090: for creating and destroying dynamic objects.
1091: .LP
1092: Finally, the
1093: .B YY_CURRENT_BUFFER
1094: macro returns a
1095: .B YY_BUFFER_STATE
1096: handle to the current buffer.
1097: .LP
1098: Here is an example of using these features for writing a scanner
1099: which expands include files (the
1100: .B <<EOF>>
1101: feature is discussed below):
1102: .nf
1103:
1104: /* the "incl" state is used for picking up the name
1105: * of an include file
1106: */
1107: %x incl
1108:
1109: %{
1110: #define MAX_INCLUDE_DEPTH 10
1111: YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
1112: int include_stack_ptr = 0;
1113: %}
1114:
1115: %%
1116: include BEGIN(incl);
1117:
1118: [a-z]+ ECHO;
1119: [^a-z\\n]*\\n? ECHO;
1120:
1121: <incl>[ \\t]* /* eat the whitespace */
1122: <incl>[^ \\t\\n]+ { /* got the include file name */
1123: if ( include_stack_ptr >= MAX_INCLUDE_DEPTH )
1124: {
1125: fprintf( stderr, "Includes nested too deeply" );
1126: exit( 1 );
1127: }
1128:
1129: include_stack[include_stack_ptr++] =
1130: YY_CURRENT_BUFFER;
1131:
1132: yyin = fopen( yytext, "r" );
1133:
1134: if ( ! yyin )
1135: error( ... );
1136:
1137: yy_switch_to_buffer(
1138: yy_create_buffer( yyin, YY_BUF_SIZE ) );
1139:
1140: BEGIN(INITIAL);
1141: }
1142:
1143: <<EOF>> {
1144: if ( --include_stack_ptr < 0 )
1145: {
1146: yyterminate();
1147: }
1148:
1149: else
1150: yy_switch_to_buffer(
1151: include_stack[include_stack_ptr] );
1152: }
1153:
1154: .fi
1155: .SH END-OF-FILE RULES
1156: The special rule "<<EOF>>" indicates
1157: actions which are to be taken when an end-of-file is
1158: encountered and yywrap() returns non-zero (i.e., indicates
1159: no further files to process). The action must finish
1160: by doing one of four things:
1161: .IP -
1162: the special
1163: .B YY_NEW_FILE
1164: action, if
1165: .I yyin
1166: has been pointed at a new file to process;
1167: .IP -
1168: a
1169: .I return
1170: statement;
1171: .IP -
1172: the special
1173: .B yyterminate()
1174: action;
1175: .IP -
1176: or, switching to a new buffer using
1177: .B yy_switch_to_buffer()
1178: as shown in the example above.
1179: .LP
1180: <<EOF>> rules may not be used with other
1181: patterns; they may only be qualified with a list of start
1182: conditions. If an unqualified <<EOF>> rule is given, it
1183: applies to
1184: .I all
1185: start conditions which do not already have <<EOF>> actions. To
1186: specify an <<EOF>> rule for only the initial start condition, use
1187: .nf
1188:
1189: <INITIAL><<EOF>>
1190:
1191: .fi
1192: .LP
1193: These rules are useful for catching things like unclosed comments.
1194: An example:
1195: .nf
1196:
1197: %x quote
1198: %%
1199:
1200: ...other rules for dealing with quotes...
1201:
1202: <quote><<EOF>> {
1203: error( "unterminated quote" );
1204: yyterminate();
1205: }
1206: <<EOF>> {
1207: if ( *++filelist )
1208: {
1209: yyin = fopen( *filelist, "r" );
1210: YY_NEW_FILE;
1211: }
1212: else
1213: yyterminate();
1214: }
1215:
1216: .fi
1217: .SH MISCELLANEOUS MACROS
1218: The macro
1219: .bd
1220: YY_USER_ACTION
1221: can be redefined to provide an action
1222: which is always executed prior to the matched rule's action. For example,
1223: it could be #define'd to call a routine to convert yytext to lower-case.
1224: .LP
1225: The macro
1226: .B YY_USER_INIT
1227: may be redefined to provide an action which is always executed before
1228: the first scan (and before the scanner's internal initializations are done).
1229: For example, it could be used to call a routine to read
1230: in a data table or open a logging file.
1231: .LP
1232: In the generated scanner, the actions are all gathered in one large
1233: switch statement and separated using
1234: .B YY_BREAK,
1235: which may be redefined. By default, it is simply a "break", to separate
1236: each rule's action from the following rule's.
1237: Redefining
1238: .B YY_BREAK
1239: allows, for example, C++ users to
1240: #define YY_BREAK to do nothing (while being very careful that every
1241: rule ends with a "break" or a "return"!) to avoid suffering from
1242: unreachable statement warnings where because a rule's action ends with
1243: "return", the
1244: .B YY_BREAK
1245: is inaccessible.
1246: .SH INTERFACING WITH YACC
1247: One of the main uses of
1248: .I flex
1249: is as a companion to the
1250: .I yacc
1251: parser-generator.
1252: .I yacc
1253: parsers expect to call a routine named
1254: .B yylex()
1255: to find the next input token. The routine is supposed to
1256: return the type of the next token as well as putting any associated
1257: value in the global
1258: .B yylval.
1259: To use
1260: .I flex
1261: with
1262: .I yacc,
1263: one specifies the
1264: .B -d
1265: option to
1266: .I yacc
1267: to instruct it to generate the file
1268: .B y.tab.h
1269: containing definitions of all the
1270: .B %tokens
1271: appearing in the
1272: .I yacc
1273: input. This file is then included in the
1274: .I flex
1275: scanner. For example, if one of the tokens is "TOK_NUMBER",
1276: part of the scanner might look like:
1277: .nf
1278:
1279: %{
1280: #include "y.tab.h"
1281: %}
1282:
1283: %%
1284:
1285: [0-9]+ yylval = atoi( yytext ); return TOK_NUMBER;
1286:
1287: .fi
1288: .SH TRANSLATION TABLE
1289: In the name of POSIX compliance,
1290: .I flex
1291: supports a
1292: .I translation table
1293: for mapping input characters into groups.
1294: The table is specified in the first section, and its format looks like:
1295: .nf
1296:
1297: %t
1298: 1 abcd
1299: 2 ABCDEFGHIJKLMNOPQRSTUVWXYZ
1300: 52 0123456789
1301: 6 \\t\\ \\n
1302: %t
1303:
1304: .fi
1305: This example specifies that the characters 'a', 'b', 'c', and 'd'
1306: are to all be lumped into group #1, upper-case letters
1307: in group #2, digits in group #52, tabs, blanks, and newlines into
1308: group #6, and
1309: .I
1310: no other characters will appear in the patterns.
1311: The group numbers are actually disregarded by
1312: .I flex;
1313: .B %t
1314: serves, though, to lump characters together. Given the above
1315: table, for example, the pattern "a(AA)*5" is equivalent to "d(ZQ)*0".
1316: They both say, "match any character in group #1, followed by
1317: zero-or-more pairs of characters
1318: from group #2, followed by a character from group #52." Thus
1319: .B %t
1320: provides a crude way for introducing equivalence classes into
1321: the scanner specification.
1322: .LP
1323: Note that the
1324: .B -i
1325: option (see below) coupled with the equivalence classes which
1326: .I flex
1327: automatically generates take care of virtually all the instances
1328: when one might consider using
1329: .B %t.
1330: But what the hell, it's there if you want it.
1331: .SH OPTIONS
1332: .I flex
1333: has the following options:
1334: .TP
1335: .B -b
1336: Generate backtracking information to
1337: .I lex.backtrack.
1338: This is a list of scanner states which require backtracking
1339: and the input characters on which they do so. By adding rules one
1340: can remove backtracking states. If all backtracking states
1341: are eliminated and
1342: .B -f
1343: or
1344: .B -F
1345: is used, the generated scanner will run faster (see the
1346: .B -p
1347: flag). Only users who wish to squeeze every last cycle out of their
1348: scanners need worry about this option. (See the section on PERFORMANCE
1349: CONSIDERATIONS below.)
1350: .TP
1351: .B -c
1352: is a do-nothing, deprecated option included for POSIX compliance.
1353: .IP
1354: .B NOTE:
1355: in previous releases of
1356: .I flex
1357: .B -c
1358: specified table-compression options. This functionality is
1359: now given by the
1360: .B -C
1361: flag. To ease the the impact of this change, when
1362: .I flex
1363: encounters
1364: .B -c,
1365: it currently issues a warning message and assumes that
1366: .B -C
1367: was desired instead. In the future this "promotion" of
1368: .B -c
1369: to
1370: .B -C
1371: will go away in the name of full POSIX compliance (unless
1372: the POSIX meaning is removed first).
1373: .TP
1374: .B -d
1375: makes the generated scanner run in
1376: .I debug
1377: mode. Whenever a pattern is recognized and the global
1378: .B yy_flex_debug
1379: is non-zero (which is the default),
1380: the scanner will write to
1381: .I stderr
1382: a line of the form:
1383: .nf
1384:
1385: --accepting rule at line 53 ("the matched text")
1386:
1387: .fi
1388: The line number refers to the location of the rule in the file
1389: defining the scanner (i.e., the file that was fed to flex). Messages
1390: are also generated when the scanner backtracks, accepts the
1391: default rule, reaches the end of its input buffer (or encounters
1392: a NUL; at this point, the two look the same as far as the scanner's concerned),
1393: or reaches an end-of-file.
1394: .TP
1395: .B -f
1396: specifies (take your pick)
1397: .I full table
1398: or
1399: .I fast scanner.
1400: No table compression is done. The result is large but fast.
1401: This option is equivalent to
1402: .B -Cf
1403: (see below).
1404: .TP
1405: .B -i
1406: instructs
1407: .I flex
1408: to generate a
1409: .I case-insensitive
1410: scanner. The case of letters given in the
1411: .I flex
1412: input patterns will
1413: be ignored, and tokens in the input will be matched regardless of case. The
1414: matched text given in
1415: .I yytext
1416: will have the preserved case (i.e., it will not be folded).
1417: .TP
1418: .B -n
1419: is another do-nothing, deprecated option included only for
1420: POSIX compliance.
1421: .TP
1422: .B -p
1423: generates a performance report to stderr. The report
1424: consists of comments regarding features of the
1425: .I flex
1426: input file which will cause a loss of performance in the resulting scanner.
1427: Note that the use of
1428: .I REJECT
1429: and variable trailing context (see the BUGS section in flex(1))
1430: entails a substantial performance penalty; use of
1431: .I yymore(),
1432: the
1433: .B ^
1434: operator,
1435: and the
1436: .B -I
1437: flag entail minor performance penalties.
1438: .TP
1439: .B -s
1440: causes the
1441: .I default rule
1442: (that unmatched scanner input is echoed to
1443: .I stdout)
1444: to be suppressed. If the scanner encounters input that does not
1445: match any of its rules, it aborts with an error. This option is
1446: useful for finding holes in a scanner's rule set.
1447: .TP
1448: .B -t
1449: instructs
1450: .I flex
1451: to write the scanner it generates to standard output instead
1452: of
1453: .B lex.yy.c.
1454: .TP
1455: .B -v
1456: specifies that
1457: .I flex
1458: should write to
1459: .I stderr
1460: a summary of statistics regarding the scanner it generates.
1461: Most of the statistics are meaningless to the casual
1462: .I flex
1463: user, but the
1464: first line identifies the version of
1465: .I flex,
1466: which is useful for figuring
1467: out where you stand with respect to patches and new releases,
1468: and the next two lines give the date when the scanner was created
1469: and a summary of the flags which were in effect.
1470: .TP
1471: .B -F
1472: specifies that the
1473: .ul
1474: fast
1475: scanner table representation should be used. This representation is
1476: about as fast as the full table representation
1477: .ul
1478: (-f),
1479: and for some sets of patterns will be considerably smaller (and for
1480: others, larger). In general, if the pattern set contains both "keywords"
1481: and a catch-all, "identifier" rule, such as in the set:
1482: .nf
1483:
1484: "case" return TOK_CASE;
1485: "switch" return TOK_SWITCH;
1486: ...
1487: "default" return TOK_DEFAULT;
1488: [a-z]+ return TOK_ID;
1489:
1490: .fi
1491: then you're better off using the full table representation. If only
1492: the "identifier" rule is present and you then use a hash table or some such
1493: to detect the keywords, you're better off using
1494: .ul
1495: -F.
1496: .IP
1497: This option is equivalent to
1498: .B -CF
1499: (see below).
1500: .TP
1501: .B -I
1502: instructs
1503: .I flex
1504: to generate an
1505: .I interactive
1506: scanner. Normally, scanners generated by
1507: .I flex
1508: always look ahead one
1509: character before deciding that a rule has been matched. At the cost of
1510: some scanning overhead,
1511: .I flex
1512: will generate a scanner which only looks ahead
1513: when needed. Such scanners are called
1514: .I interactive
1515: because if you want to write a scanner for an interactive system such as a
1516: command shell, you will probably want the user's input to be terminated
1517: with a newline, and without
1518: .B -I
1519: the user will have to type a character in addition to the newline in order
1520: to have the newline recognized. This leads to dreadful interactive
1521: performance.
1522: .IP
1523: If all this seems to confusing, here's the general rule: if a human will
1524: be typing in input to your scanner, use
1525: .B -I,
1526: otherwise don't; if you don't care about squeezing the utmost performance
1527: from your scanner and you
1528: don't want to make any assumptions about the input to your scanner,
1529: use
1530: .B -I.
1531: .IP
1532: Note,
1533: .B -I
1534: cannot be used in conjunction with
1535: .I full
1536: or
1537: .I fast tables,
1538: i.e., the
1539: .B -f, -F, -Cf,
1540: or
1541: .B -CF
1542: flags.
1543: .TP
1544: .B -L
1545: instructs
1546: .I flex
1547: not to generate
1548: .B #line
1549: directives. Without this option,
1550: .I flex
1551: peppers the generated scanner
1552: with #line directives so error messages in the actions will be correctly
1553: located with respect to the original
1554: .I flex
1555: input file, and not to
1556: the fairly meaningless line numbers of
1557: .B lex.yy.c.
1558: (Unfortunately
1559: .I flex
1560: does not presently generate the necessary directives
1561: to "retarget" the line numbers for those parts of
1562: .B lex.yy.c
1563: which it generated. So if there is an error in the generated code,
1564: a meaningless line number is reported.)
1565: .TP
1566: .B -T
1567: makes
1568: .I flex
1569: run in
1570: .I trace
1571: mode. It will generate a lot of messages to
1572: .I stdout
1573: concerning
1574: the form of the input and the resultant non-deterministic and deterministic
1575: finite automata. This option is mostly for use in maintaining
1576: .I flex.
1577: .TP
1578: .B -8
1579: instructs
1580: .I flex
1581: to generate an 8-bit scanner, i.e., one which can recognize 8-bit
1582: characters. On some sites,
1583: .I flex
1584: is installed with this option as the default. On others, the default
1585: is 7-bit characters. To see which is the case, check the verbose
1586: .B (-v)
1587: output for "equivalence classes created". If the denominator of
1588: the number shown is 128, then by default
1589: .I flex
1590: is generating 7-bit characters. If it is 256, then the default is
1591: 8-bit characters and the
1592: .B -8
1593: flag is not required (but may be a good idea to keep the scanner
1594: specification portable). Feeding a 7-bit scanner 8-bit characters
1595: will result in infinite loops, bus errors, or other such fireworks,
1596: so when in doubt, use the flag. Note that if equivalence classes
1597: are used, 8-bit scanners take only slightly more table space than
1598: 7-bit scanners (128 bytes, to be exact); if equivalence classes are
1599: not used, however, then the tables may grow up to twice their
1600: 7-bit size.
1601: .TP
1602: .B -C[efmF]
1603: controls the degree of table compression.
1604: .IP
1605: .B -Ce
1606: directs
1607: .I flex
1608: to construct
1609: .I equivalence classes,
1610: i.e., sets of characters
1611: which have identical lexical properties (for example, if the only
1612: appearance of digits in the
1613: .I flex
1614: input is in the character class
1615: "[0-9]" then the digits '0', '1', ..., '9' will all be put
1616: in the same equivalence class). Equivalence classes usually give
1617: dramatic reductions in the final table/object file sizes (typically
1618: a factor of 2-5) and are pretty cheap performance-wise (one array
1619: look-up per character scanned).
1620: .IP
1621: .B -Cf
1622: specifies that the
1623: .I full
1624: scanner tables should be generated -
1625: .I flex
1626: should not compress the
1627: tables by taking advantages of similar transition functions for
1628: different states.
1629: .IP
1630: .B -CF
1631: specifies that the alternate fast scanner representation (described
1632: above under the
1633: .B -F
1634: flag)
1635: should be used.
1636: .IP
1637: .B -Cm
1638: directs
1639: .I flex
1640: to construct
1641: .I meta-equivalence classes,
1642: which are sets of equivalence classes (or characters, if equivalence
1643: classes are not being used) that are commonly used together. Meta-equivalence
1644: classes are often a big win when using compressed tables, but they
1645: have a moderate performance impact (one or two "if" tests and one
1646: array look-up per character scanned).
1647: .IP
1648: A lone
1649: .B -C
1650: specifies that the scanner tables should be compressed but neither
1651: equivalence classes nor meta-equivalence classes should be used.
1652: .IP
1653: The options
1654: .B -Cf
1655: or
1656: .B -CF
1657: and
1658: .B -Cm
1659: do not make sense together - there is no opportunity for meta-equivalence
1660: classes if the table is not being compressed. Otherwise the options
1661: may be freely mixed.
1662: .IP
1663: The default setting is
1664: .B -Cem,
1665: which specifies that
1666: .I flex
1667: should generate equivalence classes
1668: and meta-equivalence classes. This setting provides the highest
1669: degree of table compression. You can trade off
1670: faster-executing scanners at the cost of larger tables with
1671: the following generally being true:
1672: .nf
1673:
1674: slowest & smallest
1675: -Cem
1676: -Cm
1677: -Ce
1678: -C
1679: -C{f,F}e
1680: -C{f,F}
1681: fastest & largest
1682:
1683: .fi
1684: Note that scanners with the smallest tables are usually generated and
1685: compiled the quickest, so
1686: during development you will usually want to use the default, maximal
1687: compression.
1688: .IP
1689: .B -Cfe
1690: is often a good compromise between speed and size for production
1691: scanners.
1692: .IP
1693: .B -C
1694: options are not cumulative; whenever the flag is encountered, the
1695: previous -C settings are forgotten.
1696: .TP
1697: .B -Sskeleton_file
1698: overrides the default skeleton file from which
1699: .I flex
1700: constructs its scanners. You'll never need this option unless you are doing
1701: .I flex
1702: maintenance or development.
1703: .SH PERFORMANCE CONSIDERATIONS
1704: The main design goal of
1705: .I flex
1706: is that it generate high-performance scanners. It has been optimized
1707: for dealing well with large sets of rules. Aside from the effects
1708: of table compression on scanner speed outlined above,
1709: there are a number of options/actions which degrade performance. These
1710: are, from most expensive to least:
1711: .nf
1712:
1713: REJECT
1714:
1715: pattern sets that require backtracking
1716: arbitrary trailing context
1717:
1718: '^' beginning-of-line operator
1719: yymore()
1720:
1721: .fi
1722: with the first three all being quite expensive and the last two
1723: being quite cheap.
1724: .LP
1725: .B REJECT
1726: should be avoided at all costs when performance is important.
1727: It is a particularly expensive option.
1728: .LP
1729: Getting rid of backtracking is messy and often may be an enormous
1730: amount of work for a complicated scanner. In principal, one begins
1731: by using the
1732: .B -b
1733: flag to generate a
1734: .I lex.backtrack
1735: file. For example, on the input
1736: .nf
1737:
1738: %%
1739: foo return TOK_KEYWORD;
1740: foobar return TOK_KEYWORD;
1741:
1742: .fi
1743: the file looks like:
1744: .nf
1745:
1746: State #6 is non-accepting -
1747: associated rule line numbers:
1748: 2 3
1749: out-transitions: [ o ]
1750: jam-transitions: EOF [ \\001-n p-\\177 ]
1751:
1752: State #8 is non-accepting -
1753: associated rule line numbers:
1754: 3
1755: out-transitions: [ a ]
1756: jam-transitions: EOF [ \\001-` b-\\177 ]
1757:
1758: State #9 is non-accepting -
1759: associated rule line numbers:
1760: 3
1761: out-transitions: [ r ]
1762: jam-transitions: EOF [ \\001-q s-\\177 ]
1763:
1764: Compressed tables always backtrack.
1765:
1766: .fi
1767: The first few lines tell us that there's a scanner state in
1768: which it can make a transition on an 'o' but not on any other
1769: character, and that in that state the currently scanned text does not match
1770: any rule. The state occurs when trying to match the rules found
1771: at lines 2 and 3 in the input file.
1772: If the scanner is in that state and then reads
1773: something other than an 'o', it will have to backtrack to find
1774: a rule which is matched. With
1775: a bit of headscratching one can see that this must be the
1776: state it's in when it has seen "fo". When this has happened,
1777: if anything other than another 'o' is seen, the scanner will
1778: have to back up to simply match the 'f' (by the default rule).
1779: .LP
1780: The comment regarding State #8 indicates there's a problem
1781: when "foob" has been scanned. Indeed, on any character other
1782: than a 'b', the scanner will have to back up to accept "foo".
1783: Similarly, the comment for State #9 concerns when "fooba" has
1784: been scanned.
1785: .LP
1786: The final comment reminds us that there's no point going to
1787: all the trouble of removing backtracking from the rules unless
1788: we're using
1789: .B -f
1790: or
1791: .B -F,
1792: since there's no performance gain doing so with compressed scanners.
1793: .LP
1794: The way to remove the backtracking is to add "error" rules:
1795: .nf
1796:
1797: %%
1798: foo return TOK_KEYWORD;
1799: foobar return TOK_KEYWORD;
1800:
1801: fooba |
1802: foob |
1803: fo {
1804: /* false alarm, not really a keyword */
1805: return TOK_ID;
1806: }
1807:
1808: .fi
1809: .LP
1810: Eliminating backtracking among a list of keywords can also be
1811: done using a "catch-all" rule:
1812: .nf
1813:
1814: %%
1815: foo return TOK_KEYWORD;
1816: foobar return TOK_KEYWORD;
1817:
1818: [a-z]+ return TOK_ID;
1819:
1820: .fi
1821: This is usually the best solution when appropriate.
1822: .LP
1823: Backtracking messages tend to cascade.
1824: With a complicated set of rules it's not uncommon to get hundreds
1825: of messages. If one can decipher them, though, it often
1826: only takes a dozen or so rules to eliminate the backtracking (though
1827: it's easy to make a mistake and have an error rule accidentally match
1828: a valid token. A possible future
1829: .I flex
1830: feature will be to automatically add rules to eliminate backtracking).
1831: .LP
1832: .I Variable
1833: trailing context (where both the leading and trailing parts do not have
1834: a fixed length) entails almost the same performance loss as
1835: .I REJECT
1836: (i.e., substantial). So when possible a rule like:
1837: .nf
1838:
1839: %%
1840: mouse|rat/(cat|dog) run();
1841:
1842: .fi
1843: is better written:
1844: .nf
1845:
1846: %%
1847: mouse/cat|dog run();
1848: rat/cat|dog run();
1849:
1850: .fi
1851: or as
1852: .nf
1853:
1854: %%
1855: mouse|rat/cat run();
1856: mouse|rat/dog run();
1857:
1858: .fi
1859: Note that here the special '|' action does
1860: .I not
1861: provide any savings, and can even make things worse (see
1862: .B BUGS
1863: in flex(1)).
1864: .LP
1865: Another area where the user can increase a scanner's performance
1866: (and one that's easier to implement) arises from the fact that
1867: the longer the tokens matched, the faster the scanner will run.
1868: This is because with long tokens the processing of most input
1869: characters takes place in the (short) inner scanning loop, and
1870: does not often have to go through the additional work of setting up
1871: the scanning environment (e.g.,
1872: .B yytext)
1873: for the action. Recall the scanner for C comments:
1874: .nf
1875:
1876: %x comment
1877: %%
1878: int line_num = 1;
1879:
1880: "/*" BEGIN(comment);
1881:
1882: <comment>[^*\\n]*
1883: <comment>"*"+[^*/\\n]*
1884: <comment>\\n ++line_num;
1885: <comment>"*"+"/" BEGIN(INITIAL);
1886:
1887: .fi
1888: This could be sped up by writing it as:
1889: .nf
1890:
1891: %x comment
1892: %%
1893: int line_num = 1;
1894:
1895: "/*" BEGIN(comment);
1896:
1897: <comment>[^*\\n]*
1898: <comment>[^*\\n]*\\n ++line_num;
1899: <comment>"*"+[^*/\\n]*
1900: <comment>"*"+[^*/\\n]*\\n ++line_num;
1901: <comment>"*"+"/" BEGIN(INITIAL);
1902:
1903: .fi
1904: Now instead of each newline requiring the processing of another
1905: action, recognizing the newlines is "distributed" over the other rules
1906: to keep the matched text as long as possible. Note that
1907: .I adding
1908: rules does
1909: .I not
1910: slow down the scanner! The speed of the scanner is independent
1911: of the number of rules or (modulo the considerations given at the
1912: beginning of this section) how complicated the rules are with
1913: regard to operators such as '*' and '|'.
1914: .LP
1915: A final example in speeding up a scanner: suppose you want to scan
1916: through a file containing identifiers and keywords, one per line
1917: and with no other extraneous characters, and recognize all the
1918: keywords. A natural first approach is:
1919: .nf
1920:
1921: %%
1922: asm |
1923: auto |
1924: break |
1925: ... etc ...
1926: volatile |
1927: while /* it's a keyword */
1928:
1929: .|\\n /* it's not a keyword */
1930:
1931: .fi
1932: To eliminate the back-tracking, introduce a catch-all rule:
1933: .nf
1934:
1935: %%
1936: asm |
1937: auto |
1938: break |
1939: ... etc ...
1940: volatile |
1941: while /* it's a keyword */
1942:
1943: [a-z]+ |
1944: .|\\n /* it's not a keyword */
1945:
1946: .fi
1947: Now, if it's guaranteed that there's exactly one word per line,
1948: then we can reduce the total number of matches by a half by
1949: merging in the recognition of newlines with that of the other
1950: tokens:
1951: .nf
1952:
1953: %%
1954: asm\\n |
1955: auto\\n |
1956: break\\n |
1957: ... etc ...
1958: volatile\\n |
1959: while\\n /* it's a keyword */
1960:
1961: [a-z]+\\n |
1962: .|\\n /* it's not a keyword */
1963:
1964: .fi
1965: One has to be careful here, as we have now reintroduced backtracking
1966: into the scanner. In particular, while
1967: .I we
1968: know that there will never be any characters in the input stream
1969: other than letters or newlines,
1970: .I flex
1971: can't figure this out, and it will plan for possibly needing backtracking
1972: when it has scanned a token like "auto" and then the next character
1973: is something other than a newline or a letter. Previously it would
1974: then just match the "auto" rule and be done, but now it has no "auto"
1975: rule, only a "auto\\n" rule. To eliminate the possibility of backtracking,
1976: we could either duplicate all rules but without final newlines, or,
1977: since we never expect to encounter such an input and therefore don't
1978: how it's classified, we can introduce one more catch-all rule, this
1979: one which doesn't include a newline:
1980: .nf
1981:
1982: %%
1983: asm\\n |
1984: auto\\n |
1985: break\\n |
1986: ... etc ...
1987: volatile\\n |
1988: while\\n /* it's a keyword */
1989:
1990: [a-z]+\\n |
1991: [a-z]+ |
1992: .|\\n /* it's not a keyword */
1993:
1994: .fi
1995: Compiled with
1996: .B -Cf,
1997: this is about as fast as one can get a
1998: .I flex
1999: scanner to go for this particular problem.
2000: .LP
2001: A final note:
2002: .I flex
2003: is slow when matching NUL's, particularly when a token contains
2004: multiple NUL's.
2005: It's best to write rules which match
2006: .I short
2007: amounts of text if it's anticipated that the text will often include NUL's.
2008: .SH INCOMPATIBILITIES WITH LEX AND POSIX
2009: .I flex
2010: is a rewrite of the Unix
2011: .I lex
2012: tool (the two implementations do not share any code, though),
2013: with some extensions and incompatibilities, both of which
2014: are of concern to those who wish to write scanners acceptable
2015: to either implementation. At present, the POSIX
2016: .I lex
2017: draft is
2018: very close to the original
2019: .I lex
2020: implementation, so some of these
2021: incompatibilities are also in conflict with the POSIX draft. But
2022: the intent is that except as noted below,
2023: .I flex
2024: as it presently stands will
2025: ultimately be POSIX conformant (i.e., that those areas of conflict with
2026: the POSIX draft will be resolved in
2027: .I flex's
2028: favor). Please bear in
2029: mind that all the comments which follow are with regard to the POSIX
2030: .I draft
2031: standard of Summer 1989, and not the final document (or subsequent
2032: drafts); they are included so
2033: .I flex
2034: users can be aware of the standardization issues and those areas where
2035: .I flex
2036: may in the near future undergo changes incompatible with
2037: its current definition.
2038: .LP
2039: .I flex
2040: is fully compatible with
2041: .I lex
2042: with the following exceptions:
2043: .IP -
2044: .I lex
2045: does not support exclusive start conditions (%x), though they
2046: are in the current POSIX draft.
2047: .IP -
2048: When definitions are expanded,
2049: .I flex
2050: encloses them in parentheses.
2051: With lex, the following:
2052: .nf
2053:
2054: NAME [A-Z][A-Z0-9]*
2055: %%
2056: foo{NAME}? printf( "Found it\\n" );
2057: %%
2058:
2059: .fi
2060: will not match the string "foo" because when the macro
2061: is expanded the rule is equivalent to "foo[A-Z][A-Z0-9]*?"
2062: and the precedence is such that the '?' is associated with
2063: "[A-Z0-9]*". With
2064: .I flex,
2065: the rule will be expanded to
2066: "foo([A-Z][A-Z0-9]*)?" and so the string "foo" will match.
2067: Note that because of this, the
2068: .B ^, $, <s>, /,
2069: and
2070: .B <<EOF>>
2071: operators cannot be used in a
2072: .I flex
2073: definition.
2074: .IP
2075: The POSIX draft interpretation is the same as
2076: .I flex's.
2077: .IP -
2078: To specify a character class which matches anything but a left bracket (']'),
2079: in
2080: .I lex
2081: one can use "[^]]" but with
2082: .I flex
2083: one must use "[^\\]]". The latter works with
2084: .I lex,
2085: too.
2086: .IP -
2087: The undocumented
2088: .I lex
2089: scanner internal variable
2090: .B yylineno
2091: is not supported. (The variable is not part of the POSIX draft.)
2092: .IP -
2093: The
2094: .B input()
2095: routine is not redefinable, though it may be called to read characters
2096: following whatever has been matched by a rule. If
2097: .B input()
2098: encounters an end-of-file the normal
2099: .B yywrap()
2100: processing is done. A ``real'' end-of-file is returned by
2101: .B input()
2102: as
2103: .I EOF.
2104: .IP
2105: Input is instead controlled by redefining the
2106: .B YY_INPUT
2107: macro.
2108: .IP
2109: The
2110: .I flex
2111: restriction that
2112: .B input()
2113: cannot be redefined is in accordance with the POSIX draft, but
2114: .B YY_INPUT
2115: has not yet been accepted into the draft.
2116: .IP -
2117: .B output()
2118: is not supported.
2119: Output from the
2120: .B ECHO
2121: macro is done to the file-pointer
2122: .I yyout
2123: (default
2124: .I stdout).
2125: .IP
2126: The POSIX draft mentions that an
2127: .B output()
2128: routine exists but currently gives no details as to what it does.
2129: .IP -
2130: The
2131: .I lex
2132: .B %r
2133: (generate a Ratfor scanner) option is not supported. It is not part
2134: of the POSIX draft.
2135: .IP -
2136: If you are providing your own yywrap() routine, you must include a
2137: "#undef yywrap" in the definitions section (section 1). Note that
2138: the "#undef" will have to be enclosed in %{}'s.
2139: .IP
2140: The POSIX draft
2141: specifies that yywrap() is a function and this is unlikely to change; so
2142: .I flex users are warned
2143: that
2144: .B yywrap()
2145: is likely to be changed to a function in the near future.
2146: .IP -
2147: After a call to
2148: .B unput(),
2149: .I yytext
2150: and
2151: .I yyleng
2152: are undefined until the next token is matched. This is not the case with
2153: .I lex
2154: or the present POSIX draft.
2155: .IP -
2156: The precedence of the
2157: .B {}
2158: (numeric range) operator is different.
2159: .I lex
2160: interprets "abc{1,3}" as "match one, two, or
2161: three occurrences of 'abc'", whereas
2162: .I flex
2163: interprets it as "match 'ab'
2164: followed by one, two, or three occurrences of 'c'". The latter is
2165: in agreement with the current POSIX draft.
2166: .IP -
2167: The precedence of the
2168: .B ^
2169: operator is different.
2170: .I lex
2171: interprets "^foo|bar" as "match either 'foo' at the beginning of a line,
2172: or 'bar' anywhere", whereas
2173: .I flex
2174: interprets it as "match either 'foo' or 'bar' if they come at the beginning
2175: of a line". The latter is in agreement with the current POSIX draft.
2176: .IP -
2177: To refer to yytext outside of the scanner source file,
2178: the correct definition with
2179: .I flex
2180: is "extern char *yytext" rather than "extern char yytext[]".
2181: This is contrary to the current POSIX draft but a point on which
2182: .I flex
2183: will not be changing, as the array representation entails a
2184: serious performance penalty. It is hoped that the POSIX draft will
2185: be emended to support the
2186: .I flex
2187: variety of declaration (as this is a fairly painless change to
2188: require of
2189: .I lex
2190: users).
2191: .IP -
2192: .I yyin
2193: is
2194: .I initialized
2195: by
2196: .I lex
2197: to be
2198: .I stdin;
2199: .I flex,
2200: on the other hand,
2201: initializes
2202: .I yyin
2203: to NULL
2204: and then
2205: .I assigns
2206: it to
2207: .I stdin
2208: the first time the scanner is called, providing
2209: .I yyin
2210: has not already been assigned to a non-NULL value. The difference is
2211: subtle, but the net effect is that with
2212: .I flex
2213: scanners,
2214: .I yyin
2215: does not have a valid value until the scanner has been called.
2216: .IP -
2217: The special table-size declarations such as
2218: .B %a
2219: supported by
2220: .I lex
2221: are not required by
2222: .I flex
2223: scanners;
2224: .I flex
2225: ignores them.
2226: .IP -
2227: The name
2228: .bd
2229: FLEX_SCANNER
2230: is #define'd so scanners may be written for use with either
2231: .I flex
2232: or
2233: .I lex.
2234: .LP
2235: The following
2236: .I flex
2237: features are not included in
2238: .I lex
2239: or the POSIX draft standard:
2240: .nf
2241:
2242: yyterminate()
2243: <<EOF>>
2244: YY_DECL
2245: #line directives
2246: %{}'s around actions
2247: yyrestart()
2248: comments beginning with '#' (deprecated)
2249: multiple actions on a line
2250:
2251: .fi
2252: This last feature refers to the fact that with
2253: .I flex
2254: you can put multiple actions on the same line, separated with
2255: semi-colons, while with
2256: .I lex,
2257: the following
2258: .nf
2259:
2260: foo handle_foo(); ++num_foos_seen;
2261:
2262: .fi
2263: is (rather surprisingly) truncated to
2264: .nf
2265:
2266: foo handle_foo();
2267:
2268: .fi
2269: .I flex
2270: does not truncate the action. Actions that are not enclosed in
2271: braces are simply terminated at the end of the line.
2272: .SH DIAGNOSTICS
2273: .I reject_used_but_not_detected undefined
2274: or
2275: .I yymore_used_but_not_detected undefined -
2276: These errors can occur at compile time. They indicate that the
2277: scanner uses
2278: .B REJECT
2279: or
2280: .B yymore()
2281: but that
2282: .I flex
2283: failed to notice the fact, meaning that
2284: .I flex
2285: scanned the first two sections looking for occurrences of these actions
2286: and failed to find any, but somehow you snuck some in (via a #include
2287: file, for example). Make an explicit reference to the action in your
2288: .I flex
2289: input file. (Note that previously
2290: .I flex
2291: supported a
2292: .B %used/%unused
2293: mechanism for dealing with this problem; this feature is still supported
2294: but now deprecated, and will go away soon unless the author hears from
2295: people who can argue compellingly that they need it.)
2296: .LP
2297: .I flex scanner jammed -
2298: a scanner compiled with
2299: .B -s
2300: has encountered an input string which wasn't matched by
2301: any of its rules.
2302: .LP
2303: .I flex input buffer overflowed -
2304: a scanner rule matched a string long enough to overflow the
2305: scanner's internal input buffer (16K bytes by default - controlled by
2306: .B YY_BUF_SIZE
2307: in "flex.skel". Note that to redefine this macro, you must first
2308: .B #undefine
2309: it).
2310: .LP
2311: .I scanner requires -8 flag -
2312: Your scanner specification includes recognizing 8-bit characters and
2313: you did not specify the -8 flag (and your site has not installed flex
2314: with -8 as the default).
2315: .LP
2316: .I too many %t classes! -
2317: You managed to put every single character into its own %t class.
2318: .I flex
2319: requires that at least one of the classes share characters.
2320: .SH DEFICIENCIES / BUGS
2321: See flex(1).
2322: .SH "SEE ALSO"
2323: .LP
2324: flex(1), lex(1), yacc(1), sed(1), awk(1).
2325: .LP
2326: M. E. Lesk and E. Schmidt,
2327: .I LEX - Lexical Analyzer Generator
2328: .SH AUTHOR
2329: Vern Paxson, with the help of many ideas and much inspiration from
2330: Van Jacobson. Original version by Jef Poskanzer. The fast table
2331: representation is a partial implementation of a design done by Van
2332: Jacobson. The implementation was done by Kevin Gong and Vern Paxson.
2333: .LP
2334: Thanks to the many
2335: .I flex
2336: beta-testers, feedbackers, and contributors, especially Casey
2337: Leedom, [email protected],
2338: Frederic Brehm, Nick Christopher, Jason Coughlin,
2339: Scott David Daniels, Leo Eskin,
2340: Chris Faylor, Eric Goldman, Eric
2341: Hughes, Jeffrey R. Jones, Kevin B. Kenny, Ronald Lamprecht,
2342: Greg Lee, Craig Leres, Mohamed el Lozy, Jim Meyering, Marc Nozell, Esmond Pitt,
2343: Jef Poskanzer, Jim Roskind,
2344: Dave Tallman, Frank Whaley, Ken Yap, and those whose names
2345: have slipped my marginal mail-archiving skills but whose contributions
2346: are appreciated all the same.
2347: .LP
2348: Thanks to Keith Bostic, John Gilmore, Craig Leres, Bob
2349: Mulcahy, Rich Salz, and Richard Stallman for help with various distribution
2350: headaches.
2351: .LP
2352: Thanks to Esmond Pitt and Earle Horton for 8-bit character support;
2353: to Benson Margulies and Fred
2354: Burke for C++ support; to Ove Ewerlid for the basics of support for
2355: NUL's; and to Eric Hughes for the basics of support for multiple buffers.
2356: .LP
2357: Work is being done on extending
2358: .I flex
2359: to generate scanners in which the
2360: state machine is directly represented in C code rather than tables.
2361: These scanners may well be substantially faster than those generated
2362: using -f or -F. If you are working in this area and are interested
2363: in comparing notes and seeing whether redundant work can be avoided,
2364: contact Ove Ewerlid ([email protected]).
2365: .LP
2366: This work was primarily done when I was at the Real Time Systems Group
2367: at the Lawrence Berkeley Laboratory in Berkeley, CA. Many thanks to all there
2368: for the support I received.
2369: .LP
2370: Send comments to:
2371: .nf
2372:
2373: Vern Paxson
2374: Computer Science Department
2375: 4126 Upson Hall
2376: Cornell University
2377: Ithaca, NY 14853-7501
2378:
2379: [email protected]
2380: decvax!cornell!vern
2381:
2382: .fi
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.