43BSDReno/pgrm/lex/flexdoc.1 - annotate

Return to flexdoc.1 CVS log
Up to [CSRG BSD Unix] / 43BSDReno / pgrm / lex
Annotation of 43BSDReno/pgrm/lex/flexdoc.1, revision 1.1.1.1

1.1       root        1: .TH FLEX 1 "26 May 1990" "Version 2.3"
                      2: .SH NAME
                      3: flex - fast lexical analyzer generator
                      4: .SH SYNOPSIS
                      5: .B flex
                      6: .B [-bcdfinpstvFILT8 -C[efmF] -Sskeleton]
                      7: .I [filename ...]
                      8: .SH DESCRIPTION
                      9: .I flex
                     10: is a tool for generating
                     11: .I scanners:
                     12: programs which recognized lexical patterns in text.
                     13: .I flex
                     14: reads
                     15: the given input files, or its standard input if no file names are given,
                     16: for a description of a scanner to generate.  The description is in
                     17: the form of pairs
                     18: of regular expressions and C code, called
                     19: .I rules.  flex
                     20: generates as output a C source file,
                     21: .B lex.yy.c,
                     22: which defines a routine
                     23: .B yylex().
                     24: This file is compiled and linked with the
                     25: .B -lfl
                     26: library to produce an executable.  When the executable is run,
                     27: it analyzes its input for occurrences
                     28: of the regular expressions.  Whenever it finds one, it executes
                     29: the corresponding C code.
                     30: .SH SOME SIMPLE EXAMPLES
                     31: .LP
                     32: First some simple examples to get the flavor of how one uses
                     33: .I flex.
                     34: The following
                     35: .I flex
                     36: input specifies a scanner which whenever it encounters the string
                     37: "username" will replace it with the user's login name:
                     38: .nf
                     39: 
                     40:     %%
                     41:     username    printf( "%s", getlogin() );
                     42: 
                     43: .fi
                     44: By default, any text not matched by a
                     45: .I flex
                     46: scanner
                     47: is copied to the output, so the net effect of this scanner is
                     48: to copy its input file to its output with each occurrence
                     49: of "username" expanded.
                     50: In this input, there is just one rule.  "username" is the
                     51: .I pattern
                     52: and the "printf" is the
                     53: .I action.
                     54: The "%%" marks the beginning of the rules.
                     55: .LP
                     56: Here's another simple example:
                     57: .nf
                     58: 
                     59:         int num_lines = 0, num_chars = 0;
                     60: 
                     61:     %%
                     62:     \\n    ++num_lines; ++num_chars;
                     63:     .     ++num_chars;
                     64: 
                     65:     %%
                     66:     main()
                     67:         {
                     68:         yylex();
                     69:         printf( "# of lines = %d, # of chars = %d\\n",
                     70:                 num_lines, num_chars );
                     71:         }
                     72: 
                     73: .fi
                     74: This scanner counts the number of characters and the number
                     75: of lines in its input (it produces no output other than the
                     76: final report on the counts).  The first line
                     77: declares two globals, "num_lines" and "num_chars", which are accessible
                     78: both inside
                     79: .B yylex()
                     80: and in the
                     81: .B main()
                     82: routine declared after the second "%%".  There are two rules, one
                     83: which matches a newline ("\\n") and increments both the line count and
                     84: the character count, and one which matches any character other than
                     85: a newline (indicated by the "." regular expression).
                     86: .LP
                     87: A somewhat more complicated example:
                     88: .nf
                     89: 
                     90:     /* scanner for a toy Pascal-like language */
                     91: 
                     92:     %{
                     93:     /* need this for the call to atof() below */
                     94:     #include <math.h>
                     95:     %}
                     96: 
                     97:     DIGIT    [0-9]
                     98:     ID       [a-z][a-z0-9]*
                     99: 
                    100:     %%
                    101: 
                    102:     {DIGIT}+    {
                    103:                 printf( "An integer: %s (%d)\\n", yytext,
                    104:                         atoi( yytext ) );
                    105:                 }
                    106: 
                    107:     {DIGIT}+"."{DIGIT}*        {
                    108:                 printf( "A float: %s (%d)\\n", yytext,
                    109:                         atof( yytext ) );
                    110:                 }
                    111: 
                    112:     if|then|begin|end|procedure|function        {
                    113:                 printf( "A keyword: %s\\n", yytext );
                    114:                 }
                    115: 
                    116:     {ID}        printf( "An identifier: %s\\n", yytext );
                    117: 
                    118:     "+"|"-"|"*"|"/"   printf( "An operator: %s\\n", yytext );
                    119: 
                    120:     "{"[^}\\n]*"}"     /* eat up one-line comments */
                    121: 
                    122:     [ \\t\\n]+          /* eat up whitespace */
                    123: 
                    124:     .           printf( "Unrecognized character: %s\\n", yytext );
                    125: 
                    126:     %%
                    127: 
                    128:     main( argc, argv )
                    129:     int argc;
                    130:     char **argv;
                    131:         {
                    132:         ++argv, --argc;  /* skip over program name */
                    133:         if ( argc > 0 )
                    134:                 yyin = fopen( argv[0], "r" );
                    135:         else
                    136:                 yyin = stdin;
                    137:         
                    138:         yylex();
                    139:         }
                    140: 
                    141: .fi
                    142: This is the beginnings of a simple scanner for a language like
                    143: Pascal.  It identifies different types of
                    144: .I tokens
                    145: and reports on what it has seen.
                    146: .LP
                    147: The details of this example will be explained in the following
                    148: sections.
                    149: .SH FORMAT OF THE INPUT FILE
                    150: The
                    151: .I flex
                    152: input file consists of three sections, separated by a line with just
                    153: .B %%
                    154: in it:
                    155: .nf
                    156: 
                    157:     definitions
                    158:     %%
                    159:     rules
                    160:     %%
                    161:     user code
                    162: 
                    163: .fi
                    164: The
                    165: .I definitions
                    166: section contains declarations of simple
                    167: .I name
                    168: definitions to simplify the scanner specification, and declarations of
                    169: .I start conditions,
                    170: which are explained in a later section.
                    171: .LP
                    172: Name definitions have the form:
                    173: .nf
                    174: 
                    175:     name definition
                    176: 
                    177: .fi
                    178: The "name" is a word beginning with a letter or an underscore ('_')
                    179: followed by zero or more letters, digits, '_', or '-' (dash).
                    180: The definition is taken to begin at the first non-white-space character
                    181: following the name and continuing to the end of the line.
                    182: The definition can subsequently be referred to using "{name}", which
                    183: will expand to "(definition)".  For example,
                    184: .nf
                    185: 
                    186:     DIGIT    [0-9]
                    187:     ID       [a-z][a-z0-9]*
                    188: 
                    189: .fi
                    190: defines "DIGIT" to be a regular expression which matches a
                    191: single digit, and
                    192: "ID" to be a regular expression which matches a letter
                    193: followed by zero-or-more letters-or-digits.
                    194: A subsequent reference to
                    195: .nf
                    196: 
                    197:     {DIGIT}+"."{DIGIT}*
                    198: 
                    199: .fi
                    200: is identical to
                    201: .nf
                    202: 
                    203:     ([0-9])+"."([0-9])*
                    204: 
                    205: .fi
                    206: and matches one-or-more digits followed by a '.' followed
                    207: by zero-or-more digits.
                    208: .LP
                    209: The
                    210: .I rules
                    211: section of the
                    212: .I flex
                    213: input contains a series of rules of the form:
                    214: .nf
                    215: 
                    216:     pattern   action
                    217: 
                    218: .fi
                    219: where the pattern must be unindented and the action must begin
                    220: on the same line.
                    221: .LP
                    222: See below for a further description of patterns and actions.
                    223: .LP
                    224: Finally, the user code section is simply copied to
                    225: .B lex.yy.c
                    226: verbatim.
                    227: It is used for companion routines which call or are called
                    228: by the scanner.  The presence of this section is optional;
                    229: if it is missing, the second
                    230: .B %%
                    231: in the input file may be skipped, too.
                    232: .LP
                    233: In the definitions and rules sections, any
                    234: .I indented
                    235: text or text enclosed in
                    236: .B %{
                    237: and
                    238: .B %}
                    239: is copied verbatim to the output (with the %{}'s removed).
                    240: The %{}'s must appear unindented on lines by themselves.
                    241: .LP
                    242: In the rules section,
                    243: any indented or %{} text appearing before the
                    244: first rule may be used to declare variables
                    245: which are local to the scanning routine and (after the declarations)
                    246: code which is to be executed whenever the scanning routine is entered.
                    247: Other indented or %{} text in the rule section is still copied to the output,
                    248: but its meaning is not well-defined and it may well cause compile-time
                    249: errors (this feature is present for
                    250: .I POSIX
                    251: compliance; see below for other such features).
                    252: .LP
                    253: In the definitions section, an unindented comment (i.e., a line
                    254: beginning with "/*") is also copied verbatim to the output up
                    255: to the next "*/".  Also, any line in the definitions section
                    256: beginning with '#' is ignored, though this style of comment is
                    257: deprecated and may go away in the future.
                    258: .SH PATTERNS
                    259: The patterns in the input are written using an extended set of regular
                    260: expressions.  These are:
                    261: .nf
                    262: 
                    263:     x          match the character 'x'
                    264:     .          any character except newline
                    265:     [xyz]      a "character class"; in this case, the pattern
                    266:                  matches either an 'x', a 'y', or a 'z'
                    267:     [abj-oZ]   a "character class" with a range in it; matches
                    268:                  an 'a', a 'b', any letter from 'j' through 'o',
                    269:                  or a 'Z'
                    270:     [^A-Z]     a "negated character class", i.e., any character
                    271:                  but those in the class.  In this case, any
                    272:                  character EXCEPT an uppercase letter.
                    273:     [^A-Z\\n]   any character EXCEPT an uppercase letter or
                    274:                  a newline
                    275:     r*         zero or more r's, where r is any regular expression
                    276:     r+         one or more r's
                    277:     r?         zero or one r's (that is, "an optional r")
                    278:     r{2,5}     anywhere from two to five r's
                    279:     r{2,}      two or more r's
                    280:     r{4}       exactly 4 r's
                    281:     {name}     the expansion of the "name" definition
                    282:                (see above)
                    283:     "[xyz]\\"foo"
                    284:                the literal string: [xyz]"foo
                    285:     \\X         if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v',
                    286:                  then the ANSI-C interpretation of \\x.
                    287:                  Otherwise, a literal 'X' (used to escape
                    288:                  operators such as '*')
                    289:     \\123       the character with octal value 123
                    290:     \\x2a       the character with hexadecimal value 2a
                    291:     (r)        match an r; parentheses are used to override
                    292:                  precedence (see below)
                    293: 
                    294: 
                    295:     rs         the regular expression r followed by the
                    296:                  regular expression s; called "concatenation"
                    297: 
                    298: 
                    299:     r|s        either an r or an s
                    300: 
                    301: 
                    302:     r/s        an r but only if it is followed by an s.  The
                    303:                  s is not part of the matched text.  This type
                    304:                  of pattern is called as "trailing context".
                    305:     ^r         an r, but only at the beginning of a line
                    306:     r$         an r, but only at the end of a line.  Equivalent
                    307:                  to "r/\\n".
                    308: 
                    309: 
                    310:     <s>r       an r, but only in start condition s (see
                    311:                below for discussion of start conditions)
                    312:     <s1,s2,s3>r
                    313:                same, but in any of start conditions s1,
                    314:                s2, or s3
                    315: 
                    316: 
                    317:     <<EOF>>    an end-of-file
                    318:     <s1,s2><<EOF>>
                    319:                an end-of-file when in start condition s1 or s2
                    320: 
                    321: .fi
                    322: The regular expressions listed above are grouped according to
                    323: precedence, from highest precedence at the top to lowest at the bottom.
                    324: Those grouped together have equal precedence.  For example,
                    325: .nf
                    326: 
                    327:     foo|bar*
                    328: 
                    329: .fi
                    330: is the same as
                    331: .nf
                    332: 
                    333:     (foo)|(ba(r*))
                    334: 
                    335: .fi
                    336: since the '*' operator has higher precedence than concatenation,
                    337: and concatenation higher than alternation ('|').  This pattern
                    338: therefore matches
                    339: .I either
                    340: the string "foo"
                    341: .I or
                    342: the string "ba" followed by zero-or-more r's.
                    343: To match "foo" or zero-or-more "bar"'s, use:
                    344: .nf
                    345: 
                    346:     foo|(bar)*
                    347: 
                    348: .fi
                    349: and to match zero-or-more "foo"'s-or-"bar"'s:
                    350: .nf
                    351: 
                    352:     (foo|bar)*
                    353: 
                    354: .fi
                    355: .LP
                    356: Some notes on patterns:
                    357: .IP -
                    358: A negated character class such as the example "[^A-Z]"
                    359: above
                    360: .I will match a newline
                    361: unless "\\n" (or an equivalent escape sequence) is one of the
                    362: characters explicitly present in the negated character class
                    363: (e.g., "[^A-Z\\n]").  This is unlike how many other regular
                    364: expression tools treat negated character classes, but unfortunately
                    365: the inconsistency is historically entrenched.
                    366: Matching newlines means that a pattern like [^"]* can match an entire
                    367: input (overflowing the scanner's input buffer) unless there's another
                    368: quote in the input.
                    369: .IP -
                    370: A rule can have at most one instance of trailing context (the '/' operator
                    371: or the '$' operator).  The start condition, '^', and "<<EOF>>" patterns
                    372: can only occur at the beginning of a pattern, and, as well as with '/' and '$',
                    373: cannot be grouped inside parentheses.  A '^' which does not occur at
                    374: the beginning of a rule or a '$' which does not occur at the end of
                    375: a rule loses its special properties and is treated as a normal character.
                    376: .IP
                    377: The following are illegal:
                    378: .nf
                    379: 
                    380:     foo/bar$
                    381:     <sc1>foo<sc2>bar
                    382: 
                    383: .fi
                    384: Note that the first of these, can be written "foo/bar\\n".
                    385: .IP
                    386: The following will result in '$' or '^' being treated as a normal character:
                    387: .nf
                    388: 
                    389:     foo|(bar$)
                    390:     foo|^bar
                    391: 
                    392: .fi
                    393: If what's wanted is a "foo" or a bar-followed-by-a-newline, the following
                    394: could be used (the special '|' action is explained below):
                    395: .nf
                    396: 
                    397:     foo      |
                    398:     bar$     /* action goes here */
                    399: 
                    400: .fi
                    401: A similar trick will work for matching a foo or a
                    402: bar-at-the-beginning-of-a-line.
                    403: .SH HOW THE INPUT IS MATCHED
                    404: When the generated scanner is run, it analyzes its input looking
                    405: for strings which match any of its patterns.  If it finds more than
                    406: one match, it takes the one matching the most text (for trailing
                    407: context rules, this includes the length of the trailing part, even
                    408: though it will then be returned to the input).  If it finds two
                    409: or more matches of the same length, the
                    410: rule listed first in the
                    411: .I flex
                    412: input file is chosen.
                    413: .LP
                    414: Once the match is determined, the text corresponding to the match
                    415: (called the
                    416: .I token)
                    417: is made available in the global character pointer
                    418: .B yytext,
                    419: and its length in the global integer
                    420: .B yyleng.
                    421: The
                    422: .I action
                    423: corresponding to the matched pattern is then executed (a more
                    424: detailed description of actions follows), and then the remaining
                    425: input is scanned for another match.
                    426: .LP
                    427: If no match is found, then the
                    428: .I default rule
                    429: is executed: the next character in the input is considered matched and
                    430: copied to the standard output.  Thus, the simplest legal
                    431: .I flex
                    432: input is:
                    433: .nf
                    434: 
                    435:     %%
                    436: 
                    437: .fi
                    438: which generates a scanner that simply copies its input (one character
                    439: at a time) to its output.
                    440: .SH ACTIONS
                    441: Each pattern in a rule has a corresponding action, which can be any
                    442: arbitrary C statement.  The pattern ends at the first non-escaped
                    443: whitespace character; the remainder of the line is its action.  If the
                    444: action is empty, then when the pattern is matched the input token
                    445: is simply discarded.  For example, here is the specification for a program
                    446: which deletes all occurrences of "zap me" from its input:
                    447: .nf
                    448: 
                    449:     %%
                    450:     "zap me"
                    451: 
                    452: .fi
                    453: (It will copy all other characters in the input to the output since
                    454: they will be matched by the default rule.)
                    455: .LP
                    456: Here is a program which compresses multiple blanks and tabs down to
                    457: a single blank, and throws away whitespace found at the end of a line:
                    458: .nf
                    459: 
                    460:     %%
                    461:     [ \\t]+        putchar( ' ' );
                    462:     [ \\t]+$       /* ignore this token */
                    463: 
                    464: .fi
                    465: .LP
                    466: If the action contains a '{', then the action spans till the balancing '}'
                    467: is found, and the action may cross multiple lines.
                    468: .I flex 
                    469: knows about C strings and comments and won't be fooled by braces found
                    470: within them, but also allows actions to begin with
                    471: .B %{
                    472: and will consider the action to be all the text up to the next
                    473: .B %}
                    474: (regardless of ordinary braces inside the action).
                    475: .LP
                    476: An action consisting solely of a vertical bar ('|') means "same as
                    477: the action for the next rule."  See below for an illustration.
                    478: .LP
                    479: Actions can include arbitrary C code, including
                    480: .B return
                    481: statements to return a value to whatever routine called
                    482: .B yylex().
                    483: Each time
                    484: .B yylex()
                    485: is called it continues processing tokens from where it last left
                    486: off until it either reaches
                    487: the end of the file or executes a return.  Once it reaches an end-of-file,
                    488: however, then any subsequent call to
                    489: .B yylex()
                    490: will simply immediately return, unless
                    491: .B yyrestart()
                    492: is first called (see below).
                    493: .LP
                    494: Actions are not allowed to modify yytext or yyleng.
                    495: .LP
                    496: There are a number of special directives which can be included within
                    497: an action:
                    498: .IP -
                    499: .B ECHO
                    500: copies yytext to the scanner's output.
                    501: .IP -
                    502: .B BEGIN
                    503: followed by the name of a start condition places the scanner in the
                    504: corresponding start condition (see below).
                    505: .IP -
                    506: .B REJECT
                    507: directs the scanner to proceed on to the "second best" rule which matched the
                    508: input (or a prefix of the input).  The rule is chosen as described
                    509: above in "How the Input is Matched", and
                    510: .B yytext
                    511: and
                    512: .B yyleng
                    513: set up appropriately.
                    514: It may either be one which matched as much text
                    515: as the originally chosen rule but came later in the
                    516: .I flex
                    517: input file, or one which matched less text.
                    518: For example, the following will both count the
                    519: words in the input and call the routine special() whenever "frob" is seen:
                    520: .nf
                    521: 
                    522:             int word_count = 0;
                    523:     %%
                    524: 
                    525:     frob        special(); REJECT;
                    526:     [^ \\t\\n]+   ++word_count;
                    527: 
                    528: .fi
                    529: Without the
                    530: .B REJECT,
                    531: any "frob"'s in the input would not be counted as words, since the
                    532: scanner normally executes only one action per token.
                    533: Multiple
                    534: .B REJECT's
                    535: are allowed, each one finding the next best choice to the currently
                    536: active rule.  For example, when the following scanner scans the token
                    537: "abcd", it will write "abcdabcaba" to the output:
                    538: .nf
                    539: 
                    540:     %%
                    541:     a        |
                    542:     ab       |
                    543:     abc      |
                    544:     abcd     ECHO; REJECT;
                    545:     .|\\n     /* eat up any unmatched character */
                    546: 
                    547: .fi
                    548: (The first three rules share the fourth's action since they use
                    549: the special '|' action.)
                    550: .B REJECT
                    551: is a particularly expensive feature in terms scanner performance;
                    552: if it is used in
                    553: .I any
                    554: of the scanner's actions it will slow down
                    555: .I all
                    556: of the scanner's matching.  Furthermore,
                    557: .B REJECT
                    558: cannot be used with the
                    559: .I -f
                    560: or
                    561: .I -F
                    562: options (see below).
                    563: .IP
                    564: Note also that unlike the other special actions,
                    565: .B REJECT
                    566: is a
                    567: .I branch;
                    568: code immediately following it in the action will
                    569: .I not
                    570: be executed.
                    571: .IP -
                    572: .B yymore()
                    573: tells the scanner that the next time it matches a rule, the corresponding
                    574: token should be
                    575: .I appended
                    576: onto the current value of
                    577: .B yytext
                    578: rather than replacing it.  For example, given the input "mega-kludge"
                    579: the following will write "mega-mega-kludge" to the output:
                    580: .nf
                    581: 
                    582:     %%
                    583:     mega-    ECHO; yymore();
                    584:     kludge   ECHO;
                    585: 
                    586: .fi
                    587: First "mega-" is matched and echoed to the output.  Then "kludge"
                    588: is matched, but the previous "mega-" is still hanging around at the
                    589: beginning of
                    590: .B yytext
                    591: so the
                    592: .B ECHO
                    593: for the "kludge" rule will actually write "mega-kludge".
                    594: The presence of
                    595: .B yymore()
                    596: in the scanner's action entails a minor performance penalty in the
                    597: scanner's matching speed.
                    598: .IP -
                    599: .B yyless(n)
                    600: returns all but the first
                    601: .I n
                    602: characters of the current token back to the input stream, where they
                    603: will be rescanned when the scanner looks for the next match.
                    604: .B yytext
                    605: and
                    606: .B yyleng
                    607: are adjusted appropriately (e.g.,
                    608: .B yyleng
                    609: will now be equal to
                    610: .I n
                    611: ).  For example, on the input "foobar" the following will write out
                    612: "foobarbar":
                    613: .nf
                    614: 
                    615:     %%
                    616:     foobar    ECHO; yyless(3);
                    617:     [a-z]+    ECHO;
                    618: 
                    619: .fi
                    620: An argument of 0 to
                    621: .B yyless
                    622: will cause the entire current input string to be scanned again.  Unless you've
                    623: changed how the scanner will subsequently process its input (using
                    624: .B BEGIN,
                    625: for example), this will result in an endless loop.
                    626: .IP -
                    627: .B unput(c)
                    628: puts the character
                    629: .I c
                    630: back onto the input stream.  It will be the next character scanned.
                    631: The following action will take the current token and cause it
                    632: to be rescanned enclosed in parentheses.
                    633: .nf
                    634: 
                    635:     {
                    636:     int i;
                    637:     unput( ')' );
                    638:     for ( i = yyleng - 1; i >= 0; --i )
                    639:         unput( yytext[i] );
                    640:     unput( '(' );
                    641:     }
                    642: 
                    643: .fi
                    644: Note that since each
                    645: .B unput()
                    646: puts the given character back at the
                    647: .I beginning
                    648: of the input stream, pushing back strings must be done back-to-front.
                    649: .IP -
                    650: .B input()
                    651: reads the next character from the input stream.  For example,
                    652: the following is one way to eat up C comments:
                    653: .nf
                    654: 
                    655:     %%
                    656:     "/*"        {
                    657:                 register int c;
                    658: 
                    659:                 for ( ; ; )
                    660:                     {
                    661:                     while ( (c = input()) != '*' &&
                    662:                             c != EOF )
                    663:                         ;    /* eat up text of comment */
                    664: 
                    665:                     if ( c == '*' )
                    666:                         {
                    667:                         while ( (c = input()) == '*' )
                    668:                             ;
                    669:                         if ( c == '/' )
                    670:                             break;    /* found the end */
                    671:                         }
                    672: 
                    673:                     if ( c == EOF )
                    674:                         {
                    675:                         error( "EOF in comment" );
                    676:                         break;
                    677:                         }
                    678:                     }
                    679:                 }
                    680: 
                    681: .fi
                    682: (Note that if the scanner is compiled using
                    683: .B C++,
                    684: then
                    685: .B input()
                    686: is instead referred to as
                    687: .B yyinput(),
                    688: in order to avoid a name clash with the
                    689: .B C++
                    690: stream by the name of
                    691: .I input.)
                    692: .IP -
                    693: .B yyterminate()
                    694: can be used in lieu of a return statement in an action.  It terminates
                    695: the scanner and returns a 0 to the scanner's caller, indicating "all done".
                    696: Subsequent calls to the scanner will immediately return unless preceded
                    697: by a call to
                    698: .B yyrestart()
                    699: (see below).
                    700: By default,
                    701: .B yyterminate()
                    702: is also called when an end-of-file is encountered.  It is a macro and
                    703: may be redefined.
                    704: .SH THE GENERATED SCANNER
                    705: The output of
                    706: .I flex
                    707: is the file
                    708: .B lex.yy.c,
                    709: which contains the scanning routine
                    710: .B yylex(),
                    711: a number of tables used by it for matching tokens, and a number
                    712: of auxiliary routines and macros.  By default,
                    713: .B yylex()
                    714: is declared as follows:
                    715: .nf
                    716: 
                    717:     int yylex()
                    718:         {
                    719:         ... various definitions and the actions in here ...
                    720:         }
                    721: 
                    722: .fi
                    723: (If your environment supports function prototypes, then it will
                    724: be "int yylex( void )".)  This definition may be changed by redefining
                    725: the "YY_DECL" macro.  For example, you could use:
                    726: .nf
                    727: 
                    728:     #undef YY_DECL
                    729:     #define YY_DECL float lexscan( a, b ) float a, b;
                    730: 
                    731: .fi
                    732: to give the scanning routine the name
                    733: .I lexscan,
                    734: returning a float, and taking two floats as arguments.  Note that
                    735: if you give arguments to the scanning routine using a
                    736: K&R-style/non-prototyped function declaration, you must terminate
                    737: the definition with a semi-colon (;).
                    738: .LP
                    739: Whenever
                    740: .B yylex()
                    741: is called, it scans tokens from the global input file
                    742: .I yyin
                    743: (which defaults to stdin).  It continues until it either reaches
                    744: an end-of-file (at which point it returns the value 0) or
                    745: one of its actions executes a
                    746: .I return
                    747: statement.
                    748: In the former case, when called again the scanner will immediately
                    749: return unless
                    750: .B yyrestart()
                    751: is called to point
                    752: .I yyin
                    753: at the new input file.  (
                    754: .B yyrestart()
                    755: takes one argument, a
                    756: .B FILE *
                    757: pointer.)
                    758: In the latter case (i.e., when an action
                    759: executes a return), the scanner may then be called again and it
                    760: will resume scanning where it left off.
                    761: .LP
                    762: By default (and for purposes of efficiency), the scanner uses
                    763: block-reads rather than simple
                    764: .I getc()
                    765: calls to read characters from
                    766: .I yyin.
                    767: The nature of how it gets its input can be controlled by redefining the
                    768: .B YY_INPUT
                    769: macro.
                    770: YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)".  Its
                    771: action is to place up to
                    772: .I max_size
                    773: characters in the character array
                    774: .I buf
                    775: and return in the integer variable
                    776: .I result
                    777: either the
                    778: number of characters read or the constant YY_NULL (0 on Unix systems)
                    779: to indicate EOF.  The default YY_INPUT reads from the
                    780: global file-pointer "yyin".
                    781: .LP
                    782: A sample redefinition of YY_INPUT (in the definitions
                    783: section of the input file):
                    784: .nf
                    785: 
                    786:     %{
                    787:     #undef YY_INPUT
                    788:     #define YY_INPUT(buf,result,max_size) \\
                    789:         result = ((buf[0] = getchar()) == EOF) ? YY_NULL : 1;
                    790:     %}
                    791: 
                    792: .fi
                    793: This definition will change the input processing to occur
                    794: one character at a time.
                    795: .LP
                    796: You also can add in things like keeping track of the
                    797: input line number this way; but don't expect your scanner to
                    798: go very fast.
                    799: .LP
                    800: When the scanner receives an end-of-file indication from YY_INPUT,
                    801: it then checks the
                    802: .B yywrap()
                    803: function.  If
                    804: .B yywrap()
                    805: returns false (zero), then it is assumed that the
                    806: function has gone ahead and set up
                    807: .I yyin
                    808: to point to another input file, and scanning continues.  If it returns
                    809: true (non-zero), then the scanner terminates, returning 0 to its
                    810: caller.
                    811: .LP
                    812: The default
                    813: .B yywrap()
                    814: always returns 1.  Presently, to redefine it you must first
                    815: "#undef yywrap", as it is currently implemented as a macro.  As indicated
                    816: by the hedging in the previous sentence, it may be changed to
                    817: a true function in the near future.
                    818: .LP
                    819: The scanner writes its
                    820: .B ECHO
                    821: output to the
                    822: .I yyout
                    823: global (default, stdout), which may be redefined by the user simply
                    824: by assigning it to some other
                    825: .B FILE
                    826: pointer.
                    827: .SH START CONDITIONS
                    828: .I flex
                    829: provides a mechanism for conditionally activating rules.  Any rule
                    830: whose pattern is prefixed with "<sc>" will only be active when
                    831: the scanner is in the start condition named "sc".  For example,
                    832: .nf
                    833: 
                    834:     <STRING>[^"]*        { /* eat up the string body ... */
                    835:                 ...
                    836:                 }
                    837: 
                    838: .fi
                    839: will be active only when the scanner is in the "STRING" start
                    840: condition, and
                    841: .nf
                    842: 
                    843:     <INITIAL,STRING,QUOTE>\\.        { /* handle an escape ... */
                    844:                 ...
                    845:                 }
                    846: 
                    847: .fi
                    848: will be active only when the current start condition is
                    849: either "INITIAL", "STRING", or "QUOTE".
                    850: .LP
                    851: Start conditions
                    852: are declared in the definitions (first) section of the input
                    853: using unindented lines beginning with either
                    854: .B %s
                    855: or
                    856: .B %x
                    857: followed by a list of names.
                    858: The former declares
                    859: .I inclusive
                    860: start conditions, the latter
                    861: .I exclusive
                    862: start conditions.  A start condition is activated using the
                    863: .B BEGIN
                    864: action.  Until the next
                    865: .B BEGIN
                    866: action is executed, rules with the given start
                    867: condition will be active and
                    868: rules with other start conditions will be inactive.
                    869: If the start condition is
                    870: .I inclusive,
                    871: then rules with no start conditions at all will also be active.
                    872: If it is
                    873: .I exclusive,
                    874: then
                    875: .I only
                    876: rules qualified with the start condition will be active.
                    877: A set of rules contingent on the same exclusive start condition
                    878: describe a scanner which is independent of any of the other rules in the
                    879: .I flex
                    880: input.  Because of this,
                    881: exclusive start conditions make it easy to specify "mini-scanners"
                    882: which scan portions of the input that are syntactically different
                    883: from the rest (e.g., comments).
                    884: .LP
                    885: If the distinction between inclusive and exclusive start conditions
                    886: is still a little vague, here's a simple example illustrating the
                    887: connection between the two.  The set of rules:
                    888: .nf
                    889: 
                    890:     %s example
                    891:     %%
                    892:     <example>foo           /* do something */
                    893: 
                    894: .fi
                    895: is equivalent to
                    896: .nf
                    897: 
                    898:     %x example
                    899:     %%
                    900:     <INITIAL,example>foo   /* do something */
                    901: 
                    902: .fi
                    903: .LP
                    904: The default rule (to
                    905: .B ECHO
                    906: any unmatched character) remains active in start conditions.
                    907: .LP
                    908: .B BEGIN(0)
                    909: returns to the original state where only the rules with
                    910: no start conditions are active.  This state can also be
                    911: referred to as the start-condition "INITIAL", so
                    912: .B BEGIN(INITIAL)
                    913: is equivalent to
                    914: .B BEGIN(0).
                    915: (The parentheses around the start condition name are not required but
                    916: are considered good style.)
                    917: .LP
                    918: .B BEGIN
                    919: actions can also be given as indented code at the beginning
                    920: of the rules section.  For example, the following will cause
                    921: the scanner to enter the "SPECIAL" start condition whenever
                    922: .I yylex()
                    923: is called and the global variable
                    924: .I enter_special
                    925: is true:
                    926: .nf
                    927: 
                    928:             int enter_special;
                    929: 
                    930:     %x SPECIAL
                    931:     %%
                    932:             if ( enter_special )
                    933:                 BEGIN(SPECIAL);
                    934: 
                    935:     <SPECIAL>blahblahblah
                    936:     ...more rules follow...
                    937: 
                    938: .fi
                    939: .LP
                    940: To illustrate the uses of start conditions,
                    941: here is a scanner which provides two different interpretations
                    942: of a string like "123.456".  By default it will treat it as
                    943: as three tokens, the integer "123", a dot ('.'), and the integer "456".
                    944: But if the string is preceded earlier in the line by the string
                    945: "expect-floats"
                    946: it will treat it as a single token, the floating-point number
                    947: 123.456:
                    948: .nf
                    949: 
                    950:     %{
                    951:     #include <math.h>
                    952:     %}
                    953:     %s expect
                    954: 
                    955:     %%
                    956:     expect-floats        BEGIN(expect);
                    957: 
                    958:     <expect>[0-9]+"."[0-9]+      {
                    959:                 printf( "found a float, = %f\\n",
                    960:                         atof( yytext ) );
                    961:                 }
                    962:     <expect>\\n           {
                    963:                 /* that's the end of the line, so
                    964:                  * we need another "expect-number"
                    965:                  * before we'll recognize any more
                    966:                  * numbers
                    967:                  */
                    968:                 BEGIN(INITIAL);
                    969:                 }
                    970: 
                    971:     [0-9]+      {
                    972:                 printf( "found an integer, = %d\\n",
                    973:                         atoi( yytext ) );
                    974:                 }
                    975: 
                    976:     "."         printf( "found a dot\\n" );
                    977: 
                    978: .fi
                    979: Here is a scanner which recognizes (and discards) C comments while
                    980: maintaining a count of the current input line.
                    981: .nf
                    982: 
                    983:     %x comment
                    984:     %%
                    985:             int line_num = 1;
                    986: 
                    987:     "/*"         BEGIN(comment);
                    988: 
                    989:     <comment>[^*\\n]*        /* eat anything that's not a '*' */
                    990:     <comment>"*"+[^*/\\n]*   /* eat up '*'s not followed by '/'s */
                    991:     <comment>\\n             ++line_num;
                    992:     <comment>"*"+"/"        BEGIN(INITIAL);
                    993: 
                    994: .fi
                    995: Note that start-conditions names are really integer values and
                    996: can be stored as such.  Thus, the above could be extended in the
                    997: following fashion:
                    998: .nf
                    999: 
                   1000:     %x comment foo
                   1001:     %%
                   1002:             int line_num = 1;
                   1003:             int comment_caller;
                   1004: 
                   1005:     "/*"         {
                   1006:                  comment_caller = INITIAL;
                   1007:                  BEGIN(comment);
                   1008:                  }
                   1009: 
                   1010:     ...
                   1011: 
                   1012:     <foo>"/*"    {
                   1013:                  comment_caller = foo;
                   1014:                  BEGIN(comment);
                   1015:                  }
                   1016: 
                   1017:     <comment>[^*\\n]*        /* eat anything that's not a '*' */
                   1018:     <comment>"*"+[^*/\\n]*   /* eat up '*'s not followed by '/'s */
                   1019:     <comment>\\n             ++line_num;
                   1020:     <comment>"*"+"/"        BEGIN(comment_caller);
                   1021: 
                   1022: .fi
                   1023: One can then implement a "stack" of start conditions using an
                   1024: array of integers.  (It is likely that such stacks will become
                   1025: a full-fledged
                   1026: .I flex
                   1027: feature in the future.)  Note, though, that
                   1028: start conditions do not have their own name-space; %s's and %x's
                   1029: declare names in the same fashion as #define's.
                   1030: .SH MULTIPLE INPUT BUFFERS
                   1031: Some scanners (such as those which support "include" files)
                   1032: require reading from several input streams.  As
                   1033: .I flex
                   1034: scanners do a large amount of buffering, one cannot control
                   1035: where the next input will be read from by simply writing a
                   1036: .B YY_INPUT
                   1037: which is sensitive to the scanning context.
                   1038: .B YY_INPUT
                   1039: is only called when the scanner reaches the end of its buffer, which
                   1040: may be a long time after scanning a statement such as an "include"
                   1041: which requires switching the input source.
                   1042: .LP
                   1043: To negotiate these sorts of problems,
                   1044: .I flex
                   1045: provides a mechanism for creating and switching between multiple
                   1046: input buffers.  An input buffer is created by using:
                   1047: .nf
                   1048: 
                   1049:     YY_BUFFER_STATE yy_create_buffer( FILE *file, int size )
                   1050: 
                   1051: .fi
                   1052: which takes a
                   1053: .I FILE
                   1054: pointer and a size and creates a buffer associated with the given
                   1055: file and large enough to hold
                   1056: .I size
                   1057: characters (when in doubt, use
                   1058: .B YY_BUF_SIZE
                   1059: for the size).  It returns a
                   1060: .B YY_BUFFER_STATE
                   1061: handle, which may then be passed to other routines:
                   1062: .nf
                   1063: 
                   1064:     void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer )
                   1065: 
                   1066: .fi
                   1067: switches the scanner's input buffer so subsequent tokens will
                   1068: come from
                   1069: .I new_buffer.
                   1070: Note that
                   1071: .B yy_switch_to_buffer()
                   1072: may be used by yywrap() to sets things up for continued scanning, instead
                   1073: of opening a new file and pointing
                   1074: .I yyin
                   1075: at it.
                   1076: .nf
                   1077: 
                   1078:     void yy_delete_buffer( YY_BUFFER_STATE buffer )
                   1079: 
                   1080: .fi
                   1081: is used to reclaim the storage associated with a buffer.
                   1082: .LP
                   1083: .B yy_new_buffer()
                   1084: is an alias for
                   1085: .B yy_create_buffer(),
                   1086: provided for compatibility with the C++ use of
                   1087: .I new
                   1088: and
                   1089: .I delete
                   1090: for creating and destroying dynamic objects.
                   1091: .LP
                   1092: Finally, the
                   1093: .B YY_CURRENT_BUFFER
                   1094: macro returns a
                   1095: .B YY_BUFFER_STATE
                   1096: handle to the current buffer.
                   1097: .LP
                   1098: Here is an example of using these features for writing a scanner
                   1099: which expands include files (the
                   1100: .B <<EOF>>
                   1101: feature is discussed below):
                   1102: .nf
                   1103: 
                   1104:     /* the "incl" state is used for picking up the name
                   1105:      * of an include file
                   1106:      */
                   1107:     %x incl
                   1108: 
                   1109:     %{
                   1110:     #define MAX_INCLUDE_DEPTH 10
                   1111:     YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
                   1112:     int include_stack_ptr = 0;
                   1113:     %}
                   1114: 
                   1115:     %%
                   1116:     include             BEGIN(incl);
                   1117: 
                   1118:     [a-z]+              ECHO;
                   1119:     [^a-z\\n]*\\n?        ECHO;
                   1120: 
                   1121:     <incl>[ \\t]*      /* eat the whitespace */
                   1122:     <incl>[^ \\t\\n]+   { /* got the include file name */
                   1123:             if ( include_stack_ptr >= MAX_INCLUDE_DEPTH )
                   1124:                 {
                   1125:                 fprintf( stderr, "Includes nested too deeply" );
                   1126:                 exit( 1 );
                   1127:                 }
                   1128: 
                   1129:             include_stack[include_stack_ptr++] =
                   1130:                 YY_CURRENT_BUFFER;
                   1131: 
                   1132:             yyin = fopen( yytext, "r" );
                   1133: 
                   1134:             if ( ! yyin )
                   1135:                 error( ... );
                   1136: 
                   1137:             yy_switch_to_buffer(
                   1138:                 yy_create_buffer( yyin, YY_BUF_SIZE ) );
                   1139: 
                   1140:             BEGIN(INITIAL);
                   1141:             }
                   1142: 
                   1143:     <<EOF>> {
                   1144:             if ( --include_stack_ptr < 0 )
                   1145:                 {
                   1146:                 yyterminate();
                   1147:                 }
                   1148: 
                   1149:             else
                   1150:                 yy_switch_to_buffer(
                   1151:                      include_stack[include_stack_ptr] );
                   1152:             }
                   1153: 
                   1154: .fi
                   1155: .SH END-OF-FILE RULES
                   1156: The special rule "<<EOF>>" indicates
                   1157: actions which are to be taken when an end-of-file is
                   1158: encountered and yywrap() returns non-zero (i.e., indicates
                   1159: no further files to process).  The action must finish
                   1160: by doing one of four things:
                   1161: .IP -
                   1162: the special
                   1163: .B YY_NEW_FILE
                   1164: action, if
                   1165: .I yyin
                   1166: has been pointed at a new file to process;
                   1167: .IP -
                   1168: a
                   1169: .I return
                   1170: statement;
                   1171: .IP -
                   1172: the special
                   1173: .B yyterminate()
                   1174: action;
                   1175: .IP -
                   1176: or, switching to a new buffer using
                   1177: .B yy_switch_to_buffer()
                   1178: as shown in the example above.
                   1179: .LP
                   1180: <<EOF>> rules may not be used with other
                   1181: patterns; they may only be qualified with a list of start
                   1182: conditions.  If an unqualified <<EOF>> rule is given, it
                   1183: applies to
                   1184: .I all
                   1185: start conditions which do not already have <<EOF>> actions.  To
                   1186: specify an <<EOF>> rule for only the initial start condition, use
                   1187: .nf
                   1188: 
                   1189:     <INITIAL><<EOF>>
                   1190: 
                   1191: .fi
                   1192: .LP
                   1193: These rules are useful for catching things like unclosed comments.
                   1194: An example:
                   1195: .nf
                   1196: 
                   1197:     %x quote
                   1198:     %%
                   1199: 
                   1200:     ...other rules for dealing with quotes...
                   1201: 
                   1202:     <quote><<EOF>>   {
                   1203:              error( "unterminated quote" );
                   1204:              yyterminate();
                   1205:              }
                   1206:     <<EOF>>  {
                   1207:              if ( *++filelist )
                   1208:                  {
                   1209:                  yyin = fopen( *filelist, "r" );
                   1210:                  YY_NEW_FILE;
                   1211:                  }
                   1212:              else
                   1213:                 yyterminate();
                   1214:              }
                   1215: 
                   1216: .fi
                   1217: .SH MISCELLANEOUS MACROS
                   1218: The macro
                   1219: .bd
                   1220: YY_USER_ACTION
                   1221: can be redefined to provide an action
                   1222: which is always executed prior to the matched rule's action.  For example,
                   1223: it could be #define'd to call a routine to convert yytext to lower-case.
                   1224: .LP
                   1225: The macro
                   1226: .B YY_USER_INIT
                   1227: may be redefined to provide an action which is always executed before
                   1228: the first scan (and before the scanner's internal initializations are done).
                   1229: For example, it could be used to call a routine to read
                   1230: in a data table or open a logging file.
                   1231: .LP
                   1232: In the generated scanner, the actions are all gathered in one large
                   1233: switch statement and separated using
                   1234: .B YY_BREAK,
                   1235: which may be redefined.  By default, it is simply a "break", to separate
                   1236: each rule's action from the following rule's.
                   1237: Redefining
                   1238: .B YY_BREAK
                   1239: allows, for example, C++ users to
                   1240: #define YY_BREAK to do nothing (while being very careful that every
                   1241: rule ends with a "break" or a "return"!) to avoid suffering from
                   1242: unreachable statement warnings where because a rule's action ends with
                   1243: "return", the
                   1244: .B YY_BREAK
                   1245: is inaccessible.
                   1246: .SH INTERFACING WITH YACC
                   1247: One of the main uses of
                   1248: .I flex
                   1249: is as a companion to the
                   1250: .I yacc
                   1251: parser-generator.
                   1252: .I yacc
                   1253: parsers expect to call a routine named
                   1254: .B yylex()
                   1255: to find the next input token.  The routine is supposed to
                   1256: return the type of the next token as well as putting any associated
                   1257: value in the global
                   1258: .B yylval.
                   1259: To use
                   1260: .I flex
                   1261: with
                   1262: .I yacc,
                   1263: one specifies the
                   1264: .B -d
                   1265: option to
                   1266: .I yacc
                   1267: to instruct it to generate the file
                   1268: .B y.tab.h
                   1269: containing definitions of all the
                   1270: .B %tokens
                   1271: appearing in the
                   1272: .I yacc
                   1273: input.  This file is then included in the
                   1274: .I flex
                   1275: scanner.  For example, if one of the tokens is "TOK_NUMBER",
                   1276: part of the scanner might look like:
                   1277: .nf
                   1278: 
                   1279:     %{
                   1280:     #include "y.tab.h"
                   1281:     %}
                   1282: 
                   1283:     %%
                   1284: 
                   1285:     [0-9]+        yylval = atoi( yytext ); return TOK_NUMBER;
                   1286: 
                   1287: .fi
                   1288: .SH TRANSLATION TABLE
                   1289: In the name of POSIX compliance,
                   1290: .I flex
                   1291: supports a
                   1292: .I translation table
                   1293: for mapping input characters into groups.
                   1294: The table is specified in the first section, and its format looks like:
                   1295: .nf
                   1296: 
                   1297:     %t
                   1298:     1        abcd
                   1299:     2        ABCDEFGHIJKLMNOPQRSTUVWXYZ
                   1300:     52       0123456789
                   1301:     6        \\t\\ \\n
                   1302:     %t
                   1303: 
                   1304: .fi
                   1305: This example specifies that the characters 'a', 'b', 'c', and 'd'
                   1306: are to all be lumped into group #1, upper-case letters
                   1307: in group #2, digits in group #52, tabs, blanks, and newlines into
                   1308: group #6, and
                   1309: .I
                   1310: no other characters will appear in the patterns.
                   1311: The group numbers are actually disregarded by
                   1312: .I flex;
                   1313: .B %t
                   1314: serves, though, to lump characters together.  Given the above
                   1315: table, for example, the pattern "a(AA)*5" is equivalent to "d(ZQ)*0".
                   1316: They both say, "match any character in group #1, followed by
                   1317: zero-or-more pairs of characters
                   1318: from group #2, followed by a character from group #52."  Thus
                   1319: .B %t
                   1320: provides a crude way for introducing equivalence classes into
                   1321: the scanner specification.
                   1322: .LP
                   1323: Note that the
                   1324: .B -i
                   1325: option (see below) coupled with the equivalence classes which
                   1326: .I flex
                   1327: automatically generates take care of virtually all the instances
                   1328: when one might consider using
                   1329: .B %t.
                   1330: But what the hell, it's there if you want it.
                   1331: .SH OPTIONS
                   1332: .I flex
                   1333: has the following options:
                   1334: .TP
                   1335: .B -b
                   1336: Generate backtracking information to
                   1337: .I lex.backtrack.
                   1338: This is a list of scanner states which require backtracking
                   1339: and the input characters on which they do so.  By adding rules one
                   1340: can remove backtracking states.  If all backtracking states
                   1341: are eliminated and
                   1342: .B -f
                   1343: or
                   1344: .B -F
                   1345: is used, the generated scanner will run faster (see the
                   1346: .B -p
                   1347: flag).  Only users who wish to squeeze every last cycle out of their
                   1348: scanners need worry about this option.  (See the section on PERFORMANCE
                   1349: CONSIDERATIONS below.)
                   1350: .TP
                   1351: .B -c
                   1352: is a do-nothing, deprecated option included for POSIX compliance.
                   1353: .IP
                   1354: .B NOTE:
                   1355: in previous releases of
                   1356: .I flex
                   1357: .B -c
                   1358: specified table-compression options.  This functionality is
                   1359: now given by the
                   1360: .B -C
                   1361: flag.  To ease the the impact of this change, when
                   1362: .I flex
                   1363: encounters
                   1364: .B -c,
                   1365: it currently issues a warning message and assumes that
                   1366: .B -C
                   1367: was desired instead.  In the future this "promotion" of
                   1368: .B -c
                   1369: to
                   1370: .B -C
                   1371: will go away in the name of full POSIX compliance (unless
                   1372: the POSIX meaning is removed first).
                   1373: .TP
                   1374: .B -d
                   1375: makes the generated scanner run in
                   1376: .I debug
                   1377: mode.  Whenever a pattern is recognized and the global
                   1378: .B yy_flex_debug
                   1379: is non-zero (which is the default),
                   1380: the scanner will write to
                   1381: .I stderr
                   1382: a line of the form:
                   1383: .nf
                   1384: 
                   1385:     --accepting rule at line 53 ("the matched text")
                   1386: 
                   1387: .fi
                   1388: The line number refers to the location of the rule in the file
                   1389: defining the scanner (i.e., the file that was fed to flex).  Messages
                   1390: are also generated when the scanner backtracks, accepts the
                   1391: default rule, reaches the end of its input buffer (or encounters
                   1392: a NUL; at this point, the two look the same as far as the scanner's concerned),
                   1393: or reaches an end-of-file.
                   1394: .TP
                   1395: .B -f
                   1396: specifies (take your pick)
                   1397: .I full table
                   1398: or
                   1399: .I fast scanner.
                   1400: No table compression is done.  The result is large but fast.
                   1401: This option is equivalent to
                   1402: .B -Cf
                   1403: (see below).
                   1404: .TP
                   1405: .B -i
                   1406: instructs
                   1407: .I flex
                   1408: to generate a
                   1409: .I case-insensitive
                   1410: scanner.  The case of letters given in the
                   1411: .I flex
                   1412: input patterns will
                   1413: be ignored, and tokens in the input will be matched regardless of case.  The
                   1414: matched text given in
                   1415: .I yytext
                   1416: will have the preserved case (i.e., it will not be folded).
                   1417: .TP
                   1418: .B -n
                   1419: is another do-nothing, deprecated option included only for
                   1420: POSIX compliance.
                   1421: .TP
                   1422: .B -p
                   1423: generates a performance report to stderr.  The report
                   1424: consists of comments regarding features of the
                   1425: .I flex
                   1426: input file which will cause a loss of performance in the resulting scanner.
                   1427: Note that the use of
                   1428: .I REJECT
                   1429: and variable trailing context (see the BUGS section in flex(1))
                   1430: entails a substantial performance penalty; use of
                   1431: .I yymore(),
                   1432: the
                   1433: .B ^
                   1434: operator,
                   1435: and the
                   1436: .B -I
                   1437: flag entail minor performance penalties.
                   1438: .TP
                   1439: .B -s
                   1440: causes the
                   1441: .I default rule
                   1442: (that unmatched scanner input is echoed to
                   1443: .I stdout)
                   1444: to be suppressed.  If the scanner encounters input that does not
                   1445: match any of its rules, it aborts with an error.  This option is
                   1446: useful for finding holes in a scanner's rule set.
                   1447: .TP
                   1448: .B -t
                   1449: instructs
                   1450: .I flex
                   1451: to write the scanner it generates to standard output instead
                   1452: of
                   1453: .B lex.yy.c.
                   1454: .TP
                   1455: .B -v
                   1456: specifies that
                   1457: .I flex
                   1458: should write to
                   1459: .I stderr
                   1460: a summary of statistics regarding the scanner it generates.
                   1461: Most of the statistics are meaningless to the casual
                   1462: .I flex
                   1463: user, but the
                   1464: first line identifies the version of
                   1465: .I flex,
                   1466: which is useful for figuring
                   1467: out where you stand with respect to patches and new releases,
                   1468: and the next two lines give the date when the scanner was created
                   1469: and a summary of the flags which were in effect.
                   1470: .TP
                   1471: .B -F
                   1472: specifies that the
                   1473: .ul
                   1474: fast
                   1475: scanner table representation should be used.  This representation is
                   1476: about as fast as the full table representation
                   1477: .ul
                   1478: (-f),
                   1479: and for some sets of patterns will be considerably smaller (and for
                   1480: others, larger).  In general, if the pattern set contains both "keywords"
                   1481: and a catch-all, "identifier" rule, such as in the set:
                   1482: .nf
                   1483: 
                   1484:     "case"    return TOK_CASE;
                   1485:     "switch"  return TOK_SWITCH;
                   1486:     ...
                   1487:     "default" return TOK_DEFAULT;
                   1488:     [a-z]+    return TOK_ID;
                   1489: 
                   1490: .fi
                   1491: then you're better off using the full table representation.  If only
                   1492: the "identifier" rule is present and you then use a hash table or some such
                   1493: to detect the keywords, you're better off using
                   1494: .ul
                   1495: -F.
                   1496: .IP
                   1497: This option is equivalent to
                   1498: .B -CF
                   1499: (see below).
                   1500: .TP
                   1501: .B -I
                   1502: instructs
                   1503: .I flex
                   1504: to generate an
                   1505: .I interactive
                   1506: scanner.  Normally, scanners generated by
                   1507: .I flex
                   1508: always look ahead one
                   1509: character before deciding that a rule has been matched.  At the cost of
                   1510: some scanning overhead,
                   1511: .I flex
                   1512: will generate a scanner which only looks ahead
                   1513: when needed.  Such scanners are called
                   1514: .I interactive
                   1515: because if you want to write a scanner for an interactive system such as a
                   1516: command shell, you will probably want the user's input to be terminated
                   1517: with a newline, and without
                   1518: .B -I
                   1519: the user will have to type a character in addition to the newline in order
                   1520: to have the newline recognized.  This leads to dreadful interactive
                   1521: performance.
                   1522: .IP
                   1523: If all this seems to confusing, here's the general rule: if a human will
                   1524: be typing in input to your scanner, use
                   1525: .B -I,
                   1526: otherwise don't; if you don't care about squeezing the utmost performance
                   1527: from your scanner and you
                   1528: don't want to make any assumptions about the input to your scanner,
                   1529: use
                   1530: .B -I.
                   1531: .IP
                   1532: Note,
                   1533: .B -I
                   1534: cannot be used in conjunction with
                   1535: .I full
                   1536: or
                   1537: .I fast tables,
                   1538: i.e., the
                   1539: .B -f, -F, -Cf,
                   1540: or
                   1541: .B -CF
                   1542: flags.
                   1543: .TP
                   1544: .B -L
                   1545: instructs
                   1546: .I flex
                   1547: not to generate
                   1548: .B #line
                   1549: directives.  Without this option,
                   1550: .I flex
                   1551: peppers the generated scanner
                   1552: with #line directives so error messages in the actions will be correctly
                   1553: located with respect to the original
                   1554: .I flex
                   1555: input file, and not to
                   1556: the fairly meaningless line numbers of
                   1557: .B lex.yy.c.
                   1558: (Unfortunately
                   1559: .I flex
                   1560: does not presently generate the necessary directives
                   1561: to "retarget" the line numbers for those parts of
                   1562: .B lex.yy.c
                   1563: which it generated.  So if there is an error in the generated code,
                   1564: a meaningless line number is reported.)
                   1565: .TP
                   1566: .B -T
                   1567: makes
                   1568: .I flex
                   1569: run in
                   1570: .I trace
                   1571: mode.  It will generate a lot of messages to
                   1572: .I stdout
                   1573: concerning
                   1574: the form of the input and the resultant non-deterministic and deterministic
                   1575: finite automata.  This option is mostly for use in maintaining
                   1576: .I flex.
                   1577: .TP
                   1578: .B -8
                   1579: instructs
                   1580: .I flex
                   1581: to generate an 8-bit scanner, i.e., one which can recognize 8-bit
                   1582: characters.  On some sites,
                   1583: .I flex
                   1584: is installed with this option as the default.  On others, the default
                   1585: is 7-bit characters.  To see which is the case, check the verbose
                   1586: .B (-v)
                   1587: output for "equivalence classes created".  If the denominator of
                   1588: the number shown is 128, then by default
                   1589: .I flex
                   1590: is generating 7-bit characters.  If it is 256, then the default is
                   1591: 8-bit characters and the
                   1592: .B -8
                   1593: flag is not required (but may be a good idea to keep the scanner
                   1594: specification portable).  Feeding a 7-bit scanner 8-bit characters
                   1595: will result in infinite loops, bus errors, or other such fireworks,
                   1596: so when in doubt, use the flag.  Note that if equivalence classes
                   1597: are used, 8-bit scanners take only slightly more table space than
                   1598: 7-bit scanners (128 bytes, to be exact); if equivalence classes are
                   1599: not used, however, then the tables may grow up to twice their
                   1600: 7-bit size.
                   1601: .TP 
                   1602: .B -C[efmF]
                   1603: controls the degree of table compression.
                   1604: .IP
                   1605: .B -Ce
                   1606: directs
                   1607: .I flex
                   1608: to construct
                   1609: .I equivalence classes,
                   1610: i.e., sets of characters
                   1611: which have identical lexical properties (for example, if the only
                   1612: appearance of digits in the
                   1613: .I flex
                   1614: input is in the character class
                   1615: "[0-9]" then the digits '0', '1', ..., '9' will all be put
                   1616: in the same equivalence class).  Equivalence classes usually give
                   1617: dramatic reductions in the final table/object file sizes (typically
                   1618: a factor of 2-5) and are pretty cheap performance-wise (one array
                   1619: look-up per character scanned).
                   1620: .IP
                   1621: .B -Cf
                   1622: specifies that the
                   1623: .I full
                   1624: scanner tables should be generated -
                   1625: .I flex
                   1626: should not compress the
                   1627: tables by taking advantages of similar transition functions for
                   1628: different states.
                   1629: .IP
                   1630: .B -CF
                   1631: specifies that the alternate fast scanner representation (described
                   1632: above under the
                   1633: .B -F
                   1634: flag)
                   1635: should be used.
                   1636: .IP
                   1637: .B -Cm
                   1638: directs
                   1639: .I flex
                   1640: to construct
                   1641: .I meta-equivalence classes,
                   1642: which are sets of equivalence classes (or characters, if equivalence
                   1643: classes are not being used) that are commonly used together.  Meta-equivalence
                   1644: classes are often a big win when using compressed tables, but they
                   1645: have a moderate performance impact (one or two "if" tests and one
                   1646: array look-up per character scanned).
                   1647: .IP
                   1648: A lone
                   1649: .B -C
                   1650: specifies that the scanner tables should be compressed but neither
                   1651: equivalence classes nor meta-equivalence classes should be used.
                   1652: .IP
                   1653: The options
                   1654: .B -Cf
                   1655: or
                   1656: .B -CF
                   1657: and
                   1658: .B -Cm
                   1659: do not make sense together - there is no opportunity for meta-equivalence
                   1660: classes if the table is not being compressed.  Otherwise the options
                   1661: may be freely mixed.
                   1662: .IP
                   1663: The default setting is
                   1664: .B -Cem,
                   1665: which specifies that
                   1666: .I flex
                   1667: should generate equivalence classes
                   1668: and meta-equivalence classes.  This setting provides the highest
                   1669: degree of table compression.  You can trade off
                   1670: faster-executing scanners at the cost of larger tables with
                   1671: the following generally being true:
                   1672: .nf
                   1673: 
                   1674:     slowest & smallest
                   1675:           -Cem
                   1676:           -Cm
                   1677:           -Ce
                   1678:           -C
                   1679:           -C{f,F}e
                   1680:           -C{f,F}
                   1681:     fastest & largest
                   1682: 
                   1683: .fi
                   1684: Note that scanners with the smallest tables are usually generated and
                   1685: compiled the quickest, so
                   1686: during development you will usually want to use the default, maximal
                   1687: compression.
                   1688: .IP
                   1689: .B -Cfe
                   1690: is often a good compromise between speed and size for production
                   1691: scanners.
                   1692: .IP
                   1693: .B -C
                   1694: options are not cumulative; whenever the flag is encountered, the
                   1695: previous -C settings are forgotten.
                   1696: .TP
                   1697: .B -Sskeleton_file
                   1698: overrides the default skeleton file from which
                   1699: .I flex
                   1700: constructs its scanners.  You'll never need this option unless you are doing
                   1701: .I flex
                   1702: maintenance or development.
                   1703: .SH PERFORMANCE CONSIDERATIONS
                   1704: The main design goal of
                   1705: .I flex
                   1706: is that it generate high-performance scanners.  It has been optimized
                   1707: for dealing well with large sets of rules.  Aside from the effects
                   1708: of table compression on scanner speed outlined above,
                   1709: there are a number of options/actions which degrade performance.  These
                   1710: are, from most expensive to least:
                   1711: .nf
                   1712: 
                   1713:     REJECT
                   1714: 
                   1715:     pattern sets that require backtracking
                   1716:     arbitrary trailing context
                   1717: 
                   1718:     '^' beginning-of-line operator
                   1719:     yymore()
                   1720: 
                   1721: .fi
                   1722: with the first three all being quite expensive and the last two
                   1723: being quite cheap.
                   1724: .LP
                   1725: .B REJECT
                   1726: should be avoided at all costs when performance is important.
                   1727: It is a particularly expensive option.
                   1728: .LP
                   1729: Getting rid of backtracking is messy and often may be an enormous
                   1730: amount of work for a complicated scanner.  In principal, one begins
                   1731: by using the
                   1732: .B -b 
                   1733: flag to generate a
                   1734: .I lex.backtrack
                   1735: file.  For example, on the input
                   1736: .nf
                   1737: 
                   1738:     %%
                   1739:     foo        return TOK_KEYWORD;
                   1740:     foobar     return TOK_KEYWORD;
                   1741: 
                   1742: .fi
                   1743: the file looks like:
                   1744: .nf
                   1745: 
                   1746:     State #6 is non-accepting -
                   1747:      associated rule line numbers:
                   1748:            2       3
                   1749:      out-transitions: [ o ]
                   1750:      jam-transitions: EOF [ \\001-n  p-\\177 ]
                   1751: 
                   1752:     State #8 is non-accepting -
                   1753:      associated rule line numbers:
                   1754:            3
                   1755:      out-transitions: [ a ]
                   1756:      jam-transitions: EOF [ \\001-`  b-\\177 ]
                   1757: 
                   1758:     State #9 is non-accepting -
                   1759:      associated rule line numbers:
                   1760:            3
                   1761:      out-transitions: [ r ]
                   1762:      jam-transitions: EOF [ \\001-q  s-\\177 ]
                   1763: 
                   1764:     Compressed tables always backtrack.
                   1765: 
                   1766: .fi
                   1767: The first few lines tell us that there's a scanner state in
                   1768: which it can make a transition on an 'o' but not on any other
                   1769: character, and that in that state the currently scanned text does not match
                   1770: any rule.  The state occurs when trying to match the rules found
                   1771: at lines 2 and 3 in the input file.
                   1772: If the scanner is in that state and then reads
                   1773: something other than an 'o', it will have to backtrack to find
                   1774: a rule which is matched.  With
                   1775: a bit of headscratching one can see that this must be the
                   1776: state it's in when it has seen "fo".  When this has happened,
                   1777: if anything other than another 'o' is seen, the scanner will
                   1778: have to back up to simply match the 'f' (by the default rule).
                   1779: .LP
                   1780: The comment regarding State #8 indicates there's a problem
                   1781: when "foob" has been scanned.  Indeed, on any character other
                   1782: than a 'b', the scanner will have to back up to accept "foo".
                   1783: Similarly, the comment for State #9 concerns when "fooba" has
                   1784: been scanned.
                   1785: .LP
                   1786: The final comment reminds us that there's no point going to
                   1787: all the trouble of removing backtracking from the rules unless
                   1788: we're using
                   1789: .B -f
                   1790: or
                   1791: .B -F,
                   1792: since there's no performance gain doing so with compressed scanners.
                   1793: .LP
                   1794: The way to remove the backtracking is to add "error" rules:
                   1795: .nf
                   1796: 
                   1797:     %%
                   1798:     foo         return TOK_KEYWORD;
                   1799:     foobar      return TOK_KEYWORD;
                   1800: 
                   1801:     fooba       |
                   1802:     foob        |
                   1803:     fo          {
                   1804:                 /* false alarm, not really a keyword */
                   1805:                 return TOK_ID;
                   1806:                 }
                   1807: 
                   1808: .fi
                   1809: .LP
                   1810: Eliminating backtracking among a list of keywords can also be
                   1811: done using a "catch-all" rule:
                   1812: .nf
                   1813: 
                   1814:     %%
                   1815:     foo         return TOK_KEYWORD;
                   1816:     foobar      return TOK_KEYWORD;
                   1817: 
                   1818:     [a-z]+      return TOK_ID;
                   1819: 
                   1820: .fi
                   1821: This is usually the best solution when appropriate.
                   1822: .LP
                   1823: Backtracking messages tend to cascade.
                   1824: With a complicated set of rules it's not uncommon to get hundreds
                   1825: of messages.  If one can decipher them, though, it often
                   1826: only takes a dozen or so rules to eliminate the backtracking (though
                   1827: it's easy to make a mistake and have an error rule accidentally match
                   1828: a valid token.  A possible future
                   1829: .I flex
                   1830: feature will be to automatically add rules to eliminate backtracking).
                   1831: .LP
                   1832: .I Variable
                   1833: trailing context (where both the leading and trailing parts do not have
                   1834: a fixed length) entails almost the same performance loss as
                   1835: .I REJECT
                   1836: (i.e., substantial).  So when possible a rule like:
                   1837: .nf
                   1838: 
                   1839:     %%
                   1840:     mouse|rat/(cat|dog)   run();
                   1841: 
                   1842: .fi
                   1843: is better written:
                   1844: .nf
                   1845: 
                   1846:     %%
                   1847:     mouse/cat|dog         run();
                   1848:     rat/cat|dog           run();
                   1849: 
                   1850: .fi
                   1851: or as
                   1852: .nf
                   1853: 
                   1854:     %%
                   1855:     mouse|rat/cat         run();
                   1856:     mouse|rat/dog         run();
                   1857: 
                   1858: .fi
                   1859: Note that here the special '|' action does
                   1860: .I not
                   1861: provide any savings, and can even make things worse (see
                   1862: .B BUGS
                   1863: in flex(1)).
                   1864: .LP
                   1865: Another area where the user can increase a scanner's performance
                   1866: (and one that's easier to implement) arises from the fact that
                   1867: the longer the tokens matched, the faster the scanner will run.
                   1868: This is because with long tokens the processing of most input
                   1869: characters takes place in the (short) inner scanning loop, and
                   1870: does not often have to go through the additional work of setting up
                   1871: the scanning environment (e.g.,
                   1872: .B yytext)
                   1873: for the action.  Recall the scanner for C comments:
                   1874: .nf
                   1875: 
                   1876:     %x comment
                   1877:     %%
                   1878:             int line_num = 1;
                   1879: 
                   1880:     "/*"         BEGIN(comment);
                   1881: 
                   1882:     <comment>[^*\\n]*
                   1883:     <comment>"*"+[^*/\\n]*
                   1884:     <comment>\\n             ++line_num;
                   1885:     <comment>"*"+"/"        BEGIN(INITIAL);
                   1886: 
                   1887: .fi
                   1888: This could be sped up by writing it as:
                   1889: .nf
                   1890: 
                   1891:     %x comment
                   1892:     %%
                   1893:             int line_num = 1;
                   1894: 
                   1895:     "/*"         BEGIN(comment);
                   1896: 
                   1897:     <comment>[^*\\n]*
                   1898:     <comment>[^*\\n]*\\n      ++line_num;
                   1899:     <comment>"*"+[^*/\\n]*
                   1900:     <comment>"*"+[^*/\\n]*\\n ++line_num;
                   1901:     <comment>"*"+"/"        BEGIN(INITIAL);
                   1902: 
                   1903: .fi
                   1904: Now instead of each newline requiring the processing of another
                   1905: action, recognizing the newlines is "distributed" over the other rules
                   1906: to keep the matched text as long as possible.  Note that
                   1907: .I adding
                   1908: rules does
                   1909: .I not
                   1910: slow down the scanner!  The speed of the scanner is independent
                   1911: of the number of rules or (modulo the considerations given at the
                   1912: beginning of this section) how complicated the rules are with
                   1913: regard to operators such as '*' and '|'.
                   1914: .LP
                   1915: A final example in speeding up a scanner: suppose you want to scan
                   1916: through a file containing identifiers and keywords, one per line
                   1917: and with no other extraneous characters, and recognize all the
                   1918: keywords.  A natural first approach is:
                   1919: .nf
                   1920: 
                   1921:     %%
                   1922:     asm      |
                   1923:     auto     |
                   1924:     break    |
                   1925:     ... etc ...
                   1926:     volatile |
                   1927:     while    /* it's a keyword */
                   1928: 
                   1929:     .|\\n     /* it's not a keyword */
                   1930: 
                   1931: .fi
                   1932: To eliminate the back-tracking, introduce a catch-all rule:
                   1933: .nf
                   1934: 
                   1935:     %%
                   1936:     asm      |
                   1937:     auto     |
                   1938:     break    |
                   1939:     ... etc ...
                   1940:     volatile |
                   1941:     while    /* it's a keyword */
                   1942: 
                   1943:     [a-z]+   |
                   1944:     .|\\n     /* it's not a keyword */
                   1945: 
                   1946: .fi
                   1947: Now, if it's guaranteed that there's exactly one word per line,
                   1948: then we can reduce the total number of matches by a half by
                   1949: merging in the recognition of newlines with that of the other
                   1950: tokens:
                   1951: .nf
                   1952: 
                   1953:     %%
                   1954:     asm\\n    |
                   1955:     auto\\n   |
                   1956:     break\\n  |
                   1957:     ... etc ...
                   1958:     volatile\\n |
                   1959:     while\\n  /* it's a keyword */
                   1960: 
                   1961:     [a-z]+\\n |
                   1962:     .|\\n     /* it's not a keyword */
                   1963: 
                   1964: .fi
                   1965: One has to be careful here, as we have now reintroduced backtracking
                   1966: into the scanner.  In particular, while
                   1967: .I we
                   1968: know that there will never be any characters in the input stream
                   1969: other than letters or newlines,
                   1970: .I flex
                   1971: can't figure this out, and it will plan for possibly needing backtracking
                   1972: when it has scanned a token like "auto" and then the next character
                   1973: is something other than a newline or a letter.  Previously it would
                   1974: then just match the "auto" rule and be done, but now it has no "auto"
                   1975: rule, only a "auto\\n" rule.  To eliminate the possibility of backtracking,
                   1976: we could either duplicate all rules but without final newlines, or,
                   1977: since we never expect to encounter such an input and therefore don't
                   1978: how it's classified, we can introduce one more catch-all rule, this
                   1979: one which doesn't include a newline:
                   1980: .nf
                   1981: 
                   1982:     %%
                   1983:     asm\\n    |
                   1984:     auto\\n   |
                   1985:     break\\n  |
                   1986:     ... etc ...
                   1987:     volatile\\n |
                   1988:     while\\n  /* it's a keyword */
                   1989: 
                   1990:     [a-z]+\\n |
                   1991:     [a-z]+   |
                   1992:     .|\\n     /* it's not a keyword */
                   1993: 
                   1994: .fi
                   1995: Compiled with
                   1996: .B -Cf,
                   1997: this is about as fast as one can get a
                   1998: .I flex 
                   1999: scanner to go for this particular problem.
                   2000: .LP
                   2001: A final note:
                   2002: .I flex
                   2003: is slow when matching NUL's, particularly when a token contains
                   2004: multiple NUL's.
                   2005: It's best to write rules which match
                   2006: .I short
                   2007: amounts of text if it's anticipated that the text will often include NUL's.
                   2008: .SH INCOMPATIBILITIES WITH LEX AND POSIX
                   2009: .I flex
                   2010: is a rewrite of the Unix
                   2011: .I lex
                   2012: tool (the two implementations do not share any code, though),
                   2013: with some extensions and incompatibilities, both of which
                   2014: are of concern to those who wish to write scanners acceptable
                   2015: to either implementation.  At present, the POSIX
                   2016: .I lex
                   2017: draft is
                   2018: very close to the original
                   2019: .I lex
                   2020: implementation, so some of these
                   2021: incompatibilities are also in conflict with the POSIX draft.  But
                   2022: the intent is that except as noted below,
                   2023: .I flex
                   2024: as it presently stands will
                   2025: ultimately be POSIX conformant (i.e., that those areas of conflict with
                   2026: the POSIX draft will be resolved in
                   2027: .I flex's
                   2028: favor).  Please bear in
                   2029: mind that all the comments which follow are with regard to the POSIX
                   2030: .I draft
                   2031: standard of Summer 1989, and not the final document (or subsequent
                   2032: drafts); they are included so
                   2033: .I flex
                   2034: users can be aware of the standardization issues and those areas where
                   2035: .I flex
                   2036: may in the near future undergo changes incompatible with
                   2037: its current definition.
                   2038: .LP
                   2039: .I flex
                   2040: is fully compatible with
                   2041: .I lex
                   2042: with the following exceptions:
                   2043: .IP -
                   2044: .I lex
                   2045: does not support exclusive start conditions (%x), though they
                   2046: are in the current POSIX draft.
                   2047: .IP -
                   2048: When definitions are expanded,
                   2049: .I flex
                   2050: encloses them in parentheses.
                   2051: With lex, the following:
                   2052: .nf
                   2053: 
                   2054:     NAME    [A-Z][A-Z0-9]*
                   2055:     %%
                   2056:     foo{NAME}?      printf( "Found it\\n" );
                   2057:     %%
                   2058: 
                   2059: .fi
                   2060: will not match the string "foo" because when the macro
                   2061: is expanded the rule is equivalent to "foo[A-Z][A-Z0-9]*?"
                   2062: and the precedence is such that the '?' is associated with
                   2063: "[A-Z0-9]*".  With
                   2064: .I flex,
                   2065: the rule will be expanded to
                   2066: "foo([A-Z][A-Z0-9]*)?" and so the string "foo" will match.
                   2067: Note that because of this, the
                   2068: .B ^, $, <s>, /,
                   2069: and
                   2070: .B <<EOF>>
                   2071: operators cannot be used in a
                   2072: .I flex
                   2073: definition.
                   2074: .IP
                   2075: The POSIX draft interpretation is the same as
                   2076: .I flex's.
                   2077: .IP -
                   2078: To specify a character class which matches anything but a left bracket (']'),
                   2079: in
                   2080: .I lex
                   2081: one can use "[^]]" but with
                   2082: .I flex
                   2083: one must use "[^\\]]".  The latter works with
                   2084: .I lex,
                   2085: too.
                   2086: .IP -
                   2087: The undocumented
                   2088: .I lex
                   2089: scanner internal variable
                   2090: .B yylineno
                   2091: is not supported.  (The variable is not part of the POSIX draft.)
                   2092: .IP -
                   2093: The
                   2094: .B input()
                   2095: routine is not redefinable, though it may be called to read characters
                   2096: following whatever has been matched by a rule.  If
                   2097: .B input()
                   2098: encounters an end-of-file the normal
                   2099: .B yywrap()
                   2100: processing is done.  A ``real'' end-of-file is returned by
                   2101: .B input()
                   2102: as
                   2103: .I EOF.
                   2104: .IP
                   2105: Input is instead controlled by redefining the
                   2106: .B YY_INPUT
                   2107: macro.
                   2108: .IP
                   2109: The
                   2110: .I flex
                   2111: restriction that
                   2112: .B input()
                   2113: cannot be redefined is in accordance with the POSIX draft, but
                   2114: .B YY_INPUT
                   2115: has not yet been accepted into the draft.
                   2116: .IP -
                   2117: .B output()
                   2118: is not supported.
                   2119: Output from the
                   2120: .B ECHO
                   2121: macro is done to the file-pointer
                   2122: .I yyout
                   2123: (default
                   2124: .I stdout).
                   2125: .IP
                   2126: The POSIX draft mentions that an
                   2127: .B output()
                   2128: routine exists but currently gives no details as to what it does.
                   2129: .IP -
                   2130: The
                   2131: .I lex
                   2132: .B %r
                   2133: (generate a Ratfor scanner) option is not supported.  It is not part
                   2134: of the POSIX draft.
                   2135: .IP -
                   2136: If you are providing your own yywrap() routine, you must include a
                   2137: "#undef yywrap" in the definitions section (section 1).  Note that
                   2138: the "#undef" will have to be enclosed in %{}'s.
                   2139: .IP
                   2140: The POSIX draft
                   2141: specifies that yywrap() is a function and this is unlikely to change; so
                   2142: .I flex users are warned
                   2143: that
                   2144: .B yywrap()
                   2145: is likely to be changed to a function in the near future.
                   2146: .IP -
                   2147: After a call to
                   2148: .B unput(),
                   2149: .I yytext
                   2150: and
                   2151: .I yyleng
                   2152: are undefined until the next token is matched.  This is not the case with
                   2153: .I lex
                   2154: or the present POSIX draft.
                   2155: .IP -
                   2156: The precedence of the
                   2157: .B {}
                   2158: (numeric range) operator is different.
                   2159: .I lex
                   2160: interprets "abc{1,3}" as "match one, two, or
                   2161: three occurrences of 'abc'", whereas
                   2162: .I flex
                   2163: interprets it as "match 'ab'
                   2164: followed by one, two, or three occurrences of 'c'".  The latter is
                   2165: in agreement with the current POSIX draft.
                   2166: .IP -
                   2167: The precedence of the
                   2168: .B ^
                   2169: operator is different.
                   2170: .I lex
                   2171: interprets "^foo|bar" as "match either 'foo' at the beginning of a line,
                   2172: or 'bar' anywhere", whereas
                   2173: .I flex
                   2174: interprets it as "match either 'foo' or 'bar' if they come at the beginning
                   2175: of a line".  The latter is in agreement with the current POSIX draft.
                   2176: .IP -
                   2177: To refer to yytext outside of the scanner source file,
                   2178: the correct definition with
                   2179: .I flex
                   2180: is "extern char *yytext" rather than "extern char yytext[]".
                   2181: This is contrary to the current POSIX draft but a point on which
                   2182: .I flex
                   2183: will not be changing, as the array representation entails a
                   2184: serious performance penalty.  It is hoped that the POSIX draft will
                   2185: be emended to support the
                   2186: .I flex
                   2187: variety of declaration (as this is a fairly painless change to
                   2188: require of
                   2189: .I lex
                   2190: users).
                   2191: .IP -
                   2192: .I yyin
                   2193: is
                   2194: .I initialized
                   2195: by
                   2196: .I lex
                   2197: to be
                   2198: .I stdin;
                   2199: .I flex,
                   2200: on the other hand,
                   2201: initializes
                   2202: .I yyin
                   2203: to NULL
                   2204: and then
                   2205: .I assigns
                   2206: it to
                   2207: .I stdin
                   2208: the first time the scanner is called, providing
                   2209: .I yyin
                   2210: has not already been assigned to a non-NULL value.  The difference is
                   2211: subtle, but the net effect is that with
                   2212: .I flex
                   2213: scanners,
                   2214: .I yyin
                   2215: does not have a valid value until the scanner has been called.
                   2216: .IP -
                   2217: The special table-size declarations such as
                   2218: .B %a
                   2219: supported by
                   2220: .I lex
                   2221: are not required by
                   2222: .I flex
                   2223: scanners;
                   2224: .I flex
                   2225: ignores them.
                   2226: .IP -
                   2227: The name
                   2228: .bd
                   2229: FLEX_SCANNER
                   2230: is #define'd so scanners may be written for use with either
                   2231: .I flex
                   2232: or
                   2233: .I lex.
                   2234: .LP
                   2235: The following
                   2236: .I flex
                   2237: features are not included in
                   2238: .I lex
                   2239: or the POSIX draft standard:
                   2240: .nf
                   2241: 
                   2242:     yyterminate()
                   2243:     <<EOF>>
                   2244:     YY_DECL
                   2245:     #line directives
                   2246:     %{}'s around actions
                   2247:     yyrestart()
                   2248:     comments beginning with '#' (deprecated)
                   2249:     multiple actions on a line
                   2250: 
                   2251: .fi
                   2252: This last feature refers to the fact that with
                   2253: .I flex
                   2254: you can put multiple actions on the same line, separated with
                   2255: semi-colons, while with
                   2256: .I lex,
                   2257: the following
                   2258: .nf
                   2259: 
                   2260:     foo    handle_foo(); ++num_foos_seen;
                   2261: 
                   2262: .fi
                   2263: is (rather surprisingly) truncated to
                   2264: .nf
                   2265: 
                   2266:     foo    handle_foo();
                   2267: 
                   2268: .fi
                   2269: .I flex
                   2270: does not truncate the action.  Actions that are not enclosed in
                   2271: braces are simply terminated at the end of the line.
                   2272: .SH DIAGNOSTICS
                   2273: .I reject_used_but_not_detected undefined
                   2274: or
                   2275: .I yymore_used_but_not_detected undefined -
                   2276: These errors can occur at compile time.  They indicate that the
                   2277: scanner uses
                   2278: .B REJECT
                   2279: or
                   2280: .B yymore()
                   2281: but that
                   2282: .I flex
                   2283: failed to notice the fact, meaning that
                   2284: .I flex
                   2285: scanned the first two sections looking for occurrences of these actions
                   2286: and failed to find any, but somehow you snuck some in (via a #include
                   2287: file, for example).  Make an explicit reference to the action in your
                   2288: .I flex
                   2289: input file.  (Note that previously
                   2290: .I flex
                   2291: supported a
                   2292: .B %used/%unused
                   2293: mechanism for dealing with this problem; this feature is still supported
                   2294: but now deprecated, and will go away soon unless the author hears from
                   2295: people who can argue compellingly that they need it.)
                   2296: .LP
                   2297: .I flex scanner jammed -
                   2298: a scanner compiled with
                   2299: .B -s
                   2300: has encountered an input string which wasn't matched by
                   2301: any of its rules.
                   2302: .LP
                   2303: .I flex input buffer overflowed -
                   2304: a scanner rule matched a string long enough to overflow the
                   2305: scanner's internal input buffer (16K bytes by default - controlled by
                   2306: .B YY_BUF_SIZE
                   2307: in "flex.skel".  Note that to redefine this macro, you must first
                   2308: .B #undefine
                   2309: it).
                   2310: .LP
                   2311: .I scanner requires -8 flag -
                   2312: Your scanner specification includes recognizing 8-bit characters and
                   2313: you did not specify the -8 flag (and your site has not installed flex
                   2314: with -8 as the default).
                   2315: .LP
                   2316: .I too many %t classes! -
                   2317: You managed to put every single character into its own %t class.
                   2318: .I flex
                   2319: requires that at least one of the classes share characters.
                   2320: .SH DEFICIENCIES / BUGS
                   2321: See flex(1).
                   2322: .SH "SEE ALSO"
                   2323: .LP
                   2324: flex(1), lex(1), yacc(1), sed(1), awk(1).
                   2325: .LP
                   2326: M. E. Lesk and E. Schmidt,
                   2327: .I LEX - Lexical Analyzer Generator
                   2328: .SH AUTHOR
                   2329: Vern Paxson, with the help of many ideas and much inspiration from
                   2330: Van Jacobson.  Original version by Jef Poskanzer.  The fast table
                   2331: representation is a partial implementation of a design done by Van
                   2332: Jacobson.  The implementation was done by Kevin Gong and Vern Paxson.
                   2333: .LP
                   2334: Thanks to the many
                   2335: .I flex
                   2336: beta-testers, feedbackers, and contributors, especially Casey
                   2337: Leedom, [email protected],
                   2338: Frederic Brehm, Nick Christopher, Jason Coughlin,
                   2339: Scott David Daniels, Leo Eskin,
                   2340: Chris Faylor, Eric Goldman, Eric
                   2341: Hughes, Jeffrey R. Jones, Kevin B. Kenny, Ronald Lamprecht,
                   2342: Greg Lee, Craig Leres, Mohamed el Lozy, Jim Meyering, Marc Nozell, Esmond Pitt,
                   2343: Jef Poskanzer, Jim Roskind,
                   2344: Dave Tallman, Frank Whaley, Ken Yap, and those whose names
                   2345: have slipped my marginal mail-archiving skills but whose contributions
                   2346: are appreciated all the same.
                   2347: .LP
                   2348: Thanks to Keith Bostic, John Gilmore, Craig Leres, Bob
                   2349: Mulcahy, Rich Salz, and Richard Stallman for help with various distribution
                   2350: headaches.
                   2351: .LP
                   2352: Thanks to Esmond Pitt and Earle Horton for 8-bit character support;
                   2353: to Benson Margulies and Fred
                   2354: Burke for C++ support; to Ove Ewerlid for the basics of support for
                   2355: NUL's; and to Eric Hughes for the basics of support for multiple buffers.
                   2356: .LP
                   2357: Work is being done on extending
                   2358: .I flex
                   2359: to generate scanners in which the
                   2360: state machine is directly represented in C code rather than tables.
                   2361: These scanners may well be substantially faster than those generated
                   2362: using -f or -F.  If you are working in this area and are interested
                   2363: in comparing notes and seeing whether redundant work can be avoided,
                   2364: contact Ove Ewerlid ([email protected]).
                   2365: .LP
                   2366: This work was primarily done when I was at the Real Time Systems Group
                   2367: at the Lawrence Berkeley Laboratory in Berkeley, CA.  Many thanks to all there
                   2368: for the support I received.
                   2369: .LP
                   2370: Send comments to:
                   2371: .nf
                   2372: 
                   2373:      Vern Paxson
                   2374:      Computer Science Department
                   2375:      4126 Upson Hall
                   2376:      Cornell University
                   2377:      Ithaca, NY 14853-7501
                   2378: 
                   2379:      [email protected]
                   2380:      decvax!cornell!vern
                   2381: 
                   2382: .fi
unix.superglobalmegacorp.com
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.