|
|
1.1 root 1: .TH FLEX 1 "20 June 1989" "Version 2.1"
2: .SH NAME
3: flex - fast lexical analyzer generator
4: .SH SYNOPSIS
5: .B flex
6: [
7: .B -bdfipstvFILT -c[efmF] -Sskeleton_file
8: ] [
9: .I filename
10: ]
11: .SH DESCRIPTION
12: .I flex
13: is a rewrite of
14: .I lex
15: intended to right some of that tool's deficiencies: in particular,
16: .I flex
17: generates lexical analyzers much faster, and the analyzers use
18: smaller tables and run faster.
19: .SH OPTIONS
20: In addition to lex's
21: .B -t
22: flag, flex has the following options:
23: .TP
24: .B -b
25: Generate backtracking information to
26: .I lex.backtrack.
27: This is a list of scanner states which require backtracking
28: and the input characters on which they do so. By adding rules one
29: can remove backtracking states. If all backtracking states
30: are eliminated and
31: .B -f
32: or
33: .B -F
34: is used, the generated scanner will run faster (see the
35: .B -p
36: flag). Only users who wish to squeeze every last cycle out of their
37: scanners need worry about this option.
38: .TP
39: .B -d
40: makes the generated scanner run in
41: .I debug
42: mode. Whenever a pattern is recognized the scanner will
43: write to
44: .I stderr
45: a line of the form:
46: .nf
47:
48: --accepting rule #n
49:
50: .fi
51: Rules are numbered sequentially with the first one being 1. Rule #0
52: is executed when the scanner backtracks; Rule #(n+1) (where
53: .I n
54: is the number of rules) indicates the default action; Rule #(n+2) indicates
55: that the input buffer is empty and needs to be refilled and then the scan
56: restarted. Rules beyond (n+2) are end-of-file actions.
57: .TP
58: .B -f
59: has the same effect as lex's -f flag (do not compress the scanner
60: tables); the mnemonic changes from
61: .I fast compilation
62: to (take your pick)
63: .I full table
64: or
65: .I fast scanner.
66: The actual compilation takes
67: .I longer,
68: since flex is I/O bound writing out the big table.
69: .IP
70: This option is equivalent to
71: .B -cf
72: (see below).
73: .TP
74: .B -i
75: instructs flex to generate a
76: .I case-insensitive
77: scanner. The case of letters given in the flex input patterns will
78: be ignored, and the rules will be matched regardless of case. The
79: matched text given in
80: .I yytext
81: will have the preserved case (i.e., it will not be folded).
82: .TP
83: .B -p
84: generates a performance report to stderr. The report
85: consists of comments regarding features of the flex input file
86: which will cause a loss of performance in the resulting scanner.
87: Note that the use of
88: .I REJECT
89: and variable trailing context (see
90: .B BUGS)
91: entails a substantial performance penalty; use of
92: .I yymore(),
93: the
94: .B ^
95: operator,
96: and the
97: .B -I
98: flag entail minor performance penalties.
99: .TP
100: .B -s
101: causes the
102: .I default rule
103: (that unmatched scanner input is echoed to
104: .I stdout)
105: to be suppressed. If the scanner encounters input that does not
106: match any of its rules, it aborts with an error. This option is
107: useful for finding holes in a scanner's rule set.
108: .TP
109: .B -v
110: has the same meaning as for lex (print to
111: .I stderr
112: a summary of statistics of the generated scanner). Many more statistics
113: are printed, though, and the summary spans several lines. Most
114: of the statistics are meaningless to the casual flex user, but the
115: first line identifies the version of flex, which is useful for figuring
116: out where you stand with respect to patches and new releases.
117: .TP
118: .B -F
119: specifies that the
120: .ul
121: fast
122: scanner table representation should be used. This representation is
123: about as fast as the full table representation
124: .ul
125: (-f),
126: and for some sets of patterns will be considerably smaller (and for
127: others, larger). In general, if the pattern set contains both "keywords"
128: and a catch-all, "identifier" rule, such as in the set:
129: .nf
130:
131: "case" return ( TOK_CASE );
132: "switch" return ( TOK_SWITCH );
133: ...
134: "default" return ( TOK_DEFAULT );
135: [a-z]+ return ( TOK_ID );
136:
137: .fi
138: then you're better off using the full table representation. If only
139: the "identifier" rule is present and you then use a hash table or some such
140: to detect the keywords, you're better off using
141: .ul
142: -F.
143: .IP
144: This option is equivalent to
145: .B -cF
146: (see below).
147: .TP
148: .B -I
149: instructs flex to generate an
150: .I interactive
151: scanner. Normally, scanners generated by flex always look ahead one
152: character before deciding that a rule has been matched. At the cost of
153: some scanning overhead, flex will generate a scanner which only looks ahead
154: when needed. Such scanners are called
155: .I interactive
156: because if you want to write a scanner for an interactive system such as a
157: command shell, you will probably want the user's input to be terminated
158: with a newline, and without
159: .B -I
160: the user will have to type a character in addition to the newline in order
161: to have the newline recognized. This leads to dreadful interactive
162: performance.
163: .IP
164: If all this seems to confusing, here's the general rule: if a human will
165: be typing in input to your scanner, use
166: .B -I,
167: otherwise don't; if you don't care about how fast your scanners run and
168: don't want to make any assumptions about the input to your scanner,
169: always use
170: .B -I.
171: .IP
172: Note,
173: .B -I
174: cannot be used in conjunction with
175: .I full
176: or
177: .I fast tables,
178: i.e., the
179: .B -f, -F, -cf,
180: or
181: .B -cF
182: flags.
183: .TP
184: .B -L
185: instructs flex to not generate
186: .B #line
187: directives (see below).
188: .TP
189: .B -T
190: makes flex run in
191: .I trace
192: mode. It will generate a lot of messages to stdout concerning
193: the form of the input and the resultant non-deterministic and deterministic
194: finite automatons. This option is mostly for use in maintaining flex.
195: .TP
196: .B -c[efmF]
197: controls the degree of table compression.
198: .B -ce
199: directs flex to construct
200: .I equivalence classes,
201: i.e., sets of characters
202: which have identical lexical properties (for example, if the only
203: appearance of digits in the flex input is in the character class
204: "[0-9]" then the digits '0', '1', ..., '9' will all be put
205: in the same equivalence class).
206: .B -cf
207: specifies that the
208: .I full
209: scanner tables should be generated - flex should not compress the
210: tables by taking advantages of similar transition functions for
211: different states.
212: .B -cF
213: specifies that the alternate fast scanner representation (described
214: above under the
215: .B -F
216: flag)
217: should be used.
218: .B -cm
219: directs flex to construct
220: .I meta-equivalence classes,
221: which are sets of equivalence classes (or characters, if equivalence
222: classes are not being used) that are commonly used together.
223: A lone
224: .B -c
225: specifies that the scanner tables should be compressed but neither
226: equivalence classes nor meta-equivalence classes should be used.
227: .IP
228: The options
229: .B -cf
230: or
231: .B -cF
232: and
233: .B -cm
234: do not make sense together - there is no opportunity for meta-equivalence
235: classes if the table is not being compressed. Otherwise the options
236: may be freely mixed.
237: .IP
238: The default setting is
239: .B -cem
240: which specifies that flex should generate equivalence classes
241: and meta-equivalence classes. This setting provides the highest
242: degree of table compression. You can trade off
243: faster-executing scanners at the cost of larger tables with
244: the following generally being true:
245: .nf
246:
247: slowest smallest
248: -cem
249: -ce
250: -cm
251: -c
252: -c{f,F}e
253: -c{f,F}
254: fastest largest
255:
256: .fi
257: Note that scanners with the smallest tables compile the quickest, so
258: during development you will usually want to use the default, maximal
259: compression.
260: .TP
261: .B -Sskeleton_file
262: overrides the default skeleton file from which flex constructs
263: its scanners. You'll never need this option unless you are doing
264: flex maintenance or development.
265: .SH INCOMPATIBILITIES WITH LEX
266: .I flex
267: is fully compatible with
268: .I lex
269: with the following exceptions:
270: .IP -
271: There is no run-time library to link with. You needn't
272: specify
273: .I -ll
274: when linking, and you must supply a main program. (Hacker's note: since
275: the lex library contains a main() which simply calls yylex(), you actually
276: .I can
277: be lazy and not supply your own main program and link with
278: .I -ll.)
279: .IP -
280: lex's
281: .B %r
282: (Ratfor scanners) and
283: .B %t
284: (translation table) options
285: are not supported.
286: .IP -
287: The do-nothing
288: .ul
289: -n
290: flag is not supported.
291: .IP -
292: When definitions are expanded, flex encloses them in parentheses.
293: With lex, the following
294: .nf
295:
296: NAME [A-Z][A-Z0-9]*
297: %%
298: foo{NAME}? printf( "Found it\\n" );
299: %%
300:
301: .fi
302: will not match the string "foo" because when the macro
303: is expanded the rule is equivalent to "foo[A-Z][A-Z0-9]*?"
304: and the precedence is such that the '?' is associated with
305: "[A-Z0-9]*". With flex, the rule will be expanded to
306: "foo([A-z][A-Z0-9]*)?" and so the string "foo" will match.
307: Note that because of this, the
308: .B ^, $, <s>,
309: and
310: .B /
311: operators cannot be used in a definition.
312: .IP -
313: The undocumented lex-scanner internal variable
314: .B yylineno
315: is not supported.
316: .IP -
317: The
318: .B input()
319: routine is not redefinable, though may be called to read characters
320: following whatever has been matched by a rule. If
321: .B input()
322: encounters an end-of-file the normal
323: .B yywrap()
324: processing is done. A ``real'' end-of-file is returned as
325: .I EOF.
326: .IP
327: Input can be controlled by redefining the
328: .B YY_INPUT
329: macro.
330: YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)". Its
331: action is to place up to max_size characters in the character buffer "buf"
332: and return in the integer variable "result" either the
333: number of characters read or the constant YY_NULL (0 on Unix systems)
334: systems) to indicate EOF. The default YY_INPUT reads from the
335: file-pointer "yyin" (which is by default
336: .I stdin),
337: so if you
338: just want to change the input file, you needn't redefine
339: YY_INPUT - just point yyin at the input file.
340: .IP
341: A sample redefinition of YY_INPUT (in the first section of the input
342: file):
343: .nf
344:
345: %{
346: #undef YY_INPUT
347: #define YY_INPUT(buf,result,max_size) \\
348: result = (buf[0] = getchar()) == EOF ? YY_NULL : 1;
349: %}
350:
351: .fi
352: You also can add in things like counting keeping track of the
353: input line number this way; but don't expect your scanner to
354: go very fast.
355: .IP -
356: .B output()
357: is not supported.
358: Output from the ECHO macro is done to the file-pointer
359: "yyout" (default
360: .I stdout).
361: .IP -
362: If you are providing your own yywrap() routine, you must "#undef yywrap"
363: first.
364: .IP -
365: To refer to yytext outside of your scanner source file, use
366: "extern char *yytext;" rather than "extern char yytext[];".
367: .IP -
368: .B yyleng
369: is a macro and not a variable, and hence cannot be accessed outside
370: of the scanner source file.
371: .IP -
372: flex reads only one input file, while lex's input is made
373: up of the concatenation of its input files.
374: .IP -
375: The name
376: .bd
377: FLEX_SCANNER
378: is #define'd so scanners may be written for use with either
379: flex or lex.
380: .IP -
381: The macro
382: .bd
383: YY_USER_ACTION
384: can be redefined to provide an action
385: which is always executed prior to the matched rule's action. For example,
386: it could be #define'd to call a routine to convert yytext to lower-case,
387: or to copy yyleng to a global variable to make it accessible outside of
388: the scanner source file.
389: .IP -
390: In the generated scanner, rules are separated using
391: .bd
392: YY_BREAK
393: instead of simple "break"'s. This allows, for example, C++ users to
394: #define YY_BREAK to do nothing (while being very careful that every
395: rule ends with a "break" or a "return"!) to avoid suffering from
396: unreachable statement warnings where a rule's action ends with "return".
397: .SH ENHANCEMENTS
398: .IP -
399: .I Exclusive start-conditions
400: can be declared by using
401: .B %x
402: instead of
403: .B %s.
404: These start-conditions have the property that when they are active,
405: .I no other rules are active.
406: Thus a set of rules governed by the same exclusive start condition
407: describe a scanner which is independent of any of the other rules in
408: the flex input. This feature makes it easy to specify "mini-scanners"
409: which scan portions of the input that are syntactically different
410: from the rest (e.g., comments).
411: .IP -
412: .I yyterminate()
413: can be used in lieu of a return statement in an action. It terminates
414: the scanner and returns a 0 to the scanner's caller, indicating "all done".
415: .IP -
416: .I End-of-file rules.
417: The special rule "<<EOF>>" indicates
418: actions which are to be taken when an end-of-file is
419: encountered and yywrap() returns non-zero (i.e., indicates
420: no further files to process). The action can either
421: point yyin at a new file to process, in which case the
422: action should finish with
423: .I YY_NEW_FILE
424: (this is a branch, so subsequent code in the action won't
425: be executed), or it should finish with a
426: .I return
427: statement. <<EOF>> rules may not be used with other
428: patterns; they may only be qualified with a list of start
429: conditions. If an unqualified <<EOF>> rule is given, it
430: applies only to the INITIAL start condition, and
431: .I not
432: to
433: .B %s
434: start conditions.
435: These rules are useful for catching things like unclosed comments.
436: An example:
437: .nf
438:
439: %x quote
440: %%
441: ...
442: <quote><<EOF>> {
443: error( "unterminated quote" );
444: yyterminate();
445: }
446: <<EOF>> {
447: yyin = fopen( next_file, "r" );
448: YY_NEW_FILE;
449: }
450:
451: .fi
452: .IP -
453: flex dynamically resizes its internal tables, so directives like "%a 3000"
454: are not needed when specifying large scanners.
455: .IP -
456: The scanning routine generated by flex is declared using the macro
457: .B YY_DECL.
458: By redefining this macro you can change the routine's name and
459: its calling sequence. For example, you could use:
460: .nf
461:
462: #undef YY_DECL
463: #define YY_DECL float lexscan( a, b ) float a, b;
464:
465: .fi
466: to give it the name
467: .I lexscan,
468: returning a float, and taking two floats as arguments. Note that
469: if you give arguments to the scanning routine, you must terminate
470: the definition with a semi-colon (;).
471: .IP -
472: flex generates
473: .B #line
474: directives mapping lines in the output to
475: their origin in the input file.
476: .IP -
477: You can put multiple actions on the same line, separated with
478: semi-colons. With lex, the following
479: .nf
480:
481: foo handle_foo(); return 1;
482:
483: .fi
484: is truncated to
485: .nf
486:
487: foo handle_foo();
488:
489: .fi
490: flex does not truncate the action. Actions that are not enclosed in
491: braces are terminated at the end of the line.
492: .IP -
493: Actions can be begun with
494: .B %{
495: and terminated with
496: .B %}.
497: In this case, flex does not count braces to figure out where the
498: action ends - actions are terminated by the closing
499: .B %}.
500: This feature is useful when the enclosed action has extraneous
501: braces in it (usually in comments or inside inactive #ifdef's)
502: that throw off the brace-count.
503: .IP -
504: All of the scanner actions (e.g.,
505: .B ECHO, yywrap ...)
506: except the
507: .B unput()
508: and
509: .B input()
510: routines,
511: are written as macros, so they can be redefined if necessary
512: without requiring a separate library to link to.
513: .IP -
514: When
515: .B yywrap()
516: indicates that the scanner is done processing (it does this by returning
517: non-zero), on subsequent calls the scanner will always immediately return
518: a value of 0. To restart it on a new input file, the action
519: .B yyrestart()
520: is used. It takes one argument, the new input file. It closes the
521: previous yyin (unless stdin) and sets up the scanners internal variables
522: so that the next call to yylex() will start scanning the new file. This
523: functionality is useful for, e.g., programs which will process a file, do some
524: work, and then get a message to parse another file.
525: .IP -
526: Flex scans the code in section 1 (inside %{}'s) and the actions for
527: occurrences of
528: .I REJECT
529: and
530: .I yymore().
531: If it doesn't see any, it assumes the features are not used and generates
532: higher-performance scanners. Flex tries to be correct in identifying
533: uses but can be fooled (for example, if a reference is made in a macro from
534: a #include file). If this happens (a feature is used and flex didn't
535: realize it) you will get a compile-time error of the form
536: .nf
537:
538: reject_used_but_not_detected undefined
539:
540: .fi
541: You can tell flex that a feature is used even if it doesn't think so
542: with
543: .B %used
544: followed by the name of the feature (for example, "%used REJECT");
545: similarly, you can specify that a feature is
546: .I not
547: used even though it thinks it is with
548: .B %unused.
549: .IP -
550: Comments may be put in the first section of the input by preceding
551: them with '#'.
552: .SH FILES
553: .TP
554: .I flex.skel
555: skeleton scanner
556: .TP
557: .I lex.yy.c
558: generated scanner (called
559: .I lexyy.c
560: on some systems).
561: .TP
562: .I lex.backtrack
563: backtracking information for
564: .B -b
565: flag (called
566: .I lex.bck
567: on some systems).
568: .SH "SEE ALSO"
569: .LP
570: lex(1)
571: .LP
572: M. E. Lesk and E. Schmidt,
573: .I LEX - Lexical Analyzer Generator
574: .SH AUTHOR
575: Vern Paxson, with the help of many ideas and much inspiration from
576: Van Jacobson. Original version by Jef Poskanzer. Fast table
577: representation is a partial implementation of a design done by Van
578: Jacobson. The implementation was done by Kevin Gong and Vern Paxson.
579: .LP
580: Thanks to the many flex beta-testers and feedbackers, especially Casey
581: Leedom, Frederic Brehm, Nick Christopher, Chris Faylor, Eric Goldman, Eric
582: Hughes, Greg Lee, Craig Leres, Mohamed el Lozy, Jim Meyering, Esmond Pitt,
583: Jef Poskanzer, and Dave Tallman. Thanks to Keith Bostic, John Gilmore, Bob
584: Mulcahy, Rich Salz, and Richard Stallman for help with various distribution
585: headaches.
586: .LP
587: Send comments to:
588: .nf
589:
590: Vern Paxson
591: Real Time Systems
592: Bldg. 46A
593: Lawrence Berkeley Laboratory
594: 1 Cyclotron Rd.
595: Berkeley, CA 94720
596:
597: (415) 486-6411
598:
599: [email protected]
600: [email protected]
601: ucbvax!csam.lbl.gov!vern
602:
603: .fi
604: I will be gone from mid-July '89 through mid-August '89. From August on,
605: the addresses are:
606: .nf
607:
608: [email protected]
609:
610: Vern Paxson
611: CS Department
612: Grad Office
613: 4126 Upson
614: Cornell University
615: Ithaca, NY 14853-7501
616:
617: <no phone number yet>
618:
619: .fi
620: Email sent to the former addresses should continue to be forwarded for
621: quite a while. Also, it looks like my username will be "paxson" and
622: not "vern". I'm planning on having a mail alias set up so "vern" will
623: still work, but if you encounter problems try "paxson".
624: .SH DIAGNOSTICS
625: .LP
626: .I flex scanner jammed -
627: a scanner compiled with
628: .B -s
629: has encountered an input string which wasn't matched by
630: any of its rules.
631: .LP
632: .I flex input buffer overflowed -
633: a scanner rule matched a string long enough to overflow the
634: scanner's internal input buffer (16K bytes - controlled by
635: .B YY_BUF_MAX
636: in "flex.skel").
637: .LP
638: .I old-style lex command ignored -
639: the flex input contains a lex command (e.g., "%n 1000") which
640: is being ignored.
641: .SH BUGS
642: .LP
643: Some trailing context
644: patterns cannot be properly matched and generate
645: warning messages ("Dangerous trailing context"). These are
646: patterns where the ending of the
647: first part of the rule matches the beginning of the second
648: part, such as "zx*/xy*", where the 'x*' matches the 'x' at
649: the beginning of the trailing context. (Lex doesn't get these
650: patterns right either.)
651: If desperate, you can use
652: .B yyless()
653: to effect arbitrary trailing context.
654: .LP
655: .I variable
656: trailing context (where both the leading and trailing parts do not have
657: a fixed length) entails the same performance loss as
658: .I REJECT
659: (i.e., substantial).
660: .LP
661: For some trailing context rules, parts which are actually fixed-length are
662: not recognized as such, leading to the abovementioned performance loss.
663: In particular, parts using '|' or {n} are always considered variable-length.
664: .LP
665: Use of unput() or input() trashes the current yytext and yyleng.
666: .LP
667: Use of unput() to push back more text than was matched can
668: result in the pushed-back text matching a beginning-of-line ('^')
669: rule even though it didn't come at the beginning of the line.
670: .LP
671: yytext and yyleng cannot be modified within a flex action.
672: .LP
673: Nulls are not allowed in flex inputs or in the inputs to
674: scanners generated by flex. Their presence generates fatal
675: errors.
676: .LP
677: Flex does not generate correct #line directives for code internal
678: to the scanner; thus, bugs in
679: .I
680: flex.skel
681: yield bogus line numbers.
682: .LP
683: Pushing back definitions enclosed in ()'s can result in nasty,
684: difficult-to-understand problems like:
685: .nf
686:
687: {DIG} [0-9] /* a digit */
688:
689: .fi
690: In which the pushed-back text is "([0-9] /* a digit */)".
691: .LP
692: Due to both buffering of input and read-ahead, you cannot intermix
693: calls to stdio routines, such as, for example,
694: .B getchar()
695: with flex rules and expect it to work. Call
696: .B input()
697: instead.
698: .LP
699: The total table entries listed by the
700: .B -v
701: flag excludes the number of table entries needed to determine
702: what rule has been matched. The number of entries is equal
703: to the number of DFA states if the scanner does not use REJECT,
704: and somewhat greater than the number of states if it does.
705: .LP
706: To be consistent with ANSI C, the escape sequence \\xhh should
707: be recognized for hexadecimal escape sequences, such as '\\x41' for 'A'.
708: .LP
709: It would be useful if flex wrote to lex.yy.c a summary of the flags used in
710: its generation (such as which table compression options).
711: .LP
712: The scanner run-time speeds still have not been optimized as much
713: as they deserve. Van Jacobson's work shows that the can go
714: faster still.
715: .LP
716: The utility needs more complete documentation.
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.