|
|
1.1 ! root 1: .TH FLEX 1 "26 May 1990" "Version 2.3" ! 2: .SH NAME ! 3: flex - fast lexical analyzer generator ! 4: .SH SYNOPSIS ! 5: .B flex ! 6: .B [-bcdfinpstvFILT8 -C[efmF] -Sskeleton] ! 7: .I [filename ...] ! 8: .SH DESCRIPTION ! 9: .I flex ! 10: is a tool for generating ! 11: .I scanners: ! 12: programs which recognized lexical patterns in text. ! 13: .I flex ! 14: reads ! 15: the given input files, or its standard input if no file names are given, ! 16: for a description of a scanner to generate. The description is in ! 17: the form of pairs ! 18: of regular expressions and C code, called ! 19: .I rules. flex ! 20: generates as output a C source file, ! 21: .B lex.yy.c, ! 22: which defines a routine ! 23: .B yylex(). ! 24: This file is compiled and linked with the ! 25: .B -lfl ! 26: library to produce an executable. When the executable is run, ! 27: it analyzes its input for occurrences ! 28: of the regular expressions. Whenever it finds one, it executes ! 29: the corresponding C code. ! 30: .SH SOME SIMPLE EXAMPLES ! 31: .LP ! 32: First some simple examples to get the flavor of how one uses ! 33: .I flex. ! 34: The following ! 35: .I flex ! 36: input specifies a scanner which whenever it encounters the string ! 37: "username" will replace it with the user's login name: ! 38: .nf ! 39: ! 40: %% ! 41: username printf( "%s", getlogin() ); ! 42: ! 43: .fi ! 44: By default, any text not matched by a ! 45: .I flex ! 46: scanner ! 47: is copied to the output, so the net effect of this scanner is ! 48: to copy its input file to its output with each occurrence ! 49: of "username" expanded. ! 50: In this input, there is just one rule. "username" is the ! 51: .I pattern ! 52: and the "printf" is the ! 53: .I action. ! 54: The "%%" marks the beginning of the rules. ! 55: .LP ! 56: Here's another simple example: ! 57: .nf ! 58: ! 59: int num_lines = 0, num_chars = 0; ! 60: ! 61: %% ! 62: \\n ++num_lines; ++num_chars; ! 63: . ++num_chars; ! 64: ! 65: %% ! 66: main() ! 67: { ! 68: yylex(); ! 69: printf( "# of lines = %d, # of chars = %d\\n", ! 70: num_lines, num_chars ); ! 71: } ! 72: ! 73: .fi ! 74: This scanner counts the number of characters and the number ! 75: of lines in its input (it produces no output other than the ! 76: final report on the counts). The first line ! 77: declares two globals, "num_lines" and "num_chars", which are accessible ! 78: both inside ! 79: .B yylex() ! 80: and in the ! 81: .B main() ! 82: routine declared after the second "%%". There are two rules, one ! 83: which matches a newline ("\\n") and increments both the line count and ! 84: the character count, and one which matches any character other than ! 85: a newline (indicated by the "." regular expression). ! 86: .LP ! 87: A somewhat more complicated example: ! 88: .nf ! 89: ! 90: /* scanner for a toy Pascal-like language */ ! 91: ! 92: %{ ! 93: /* need this for the call to atof() below */ ! 94: #include <math.h> ! 95: %} ! 96: ! 97: DIGIT [0-9] ! 98: ID [a-z][a-z0-9]* ! 99: ! 100: %% ! 101: ! 102: {DIGIT}+ { ! 103: printf( "An integer: %s (%d)\\n", yytext, ! 104: atoi( yytext ) ); ! 105: } ! 106: ! 107: {DIGIT}+"."{DIGIT}* { ! 108: printf( "A float: %s (%d)\\n", yytext, ! 109: atof( yytext ) ); ! 110: } ! 111: ! 112: if|then|begin|end|procedure|function { ! 113: printf( "A keyword: %s\\n", yytext ); ! 114: } ! 115: ! 116: {ID} printf( "An identifier: %s\\n", yytext ); ! 117: ! 118: "+"|"-"|"*"|"/" printf( "An operator: %s\\n", yytext ); ! 119: ! 120: "{"[^}\\n]*"}" /* eat up one-line comments */ ! 121: ! 122: [ \\t\\n]+ /* eat up whitespace */ ! 123: ! 124: . printf( "Unrecognized character: %s\\n", yytext ); ! 125: ! 126: %% ! 127: ! 128: main( argc, argv ) ! 129: int argc; ! 130: char **argv; ! 131: { ! 132: ++argv, --argc; /* skip over program name */ ! 133: if ( argc > 0 ) ! 134: yyin = fopen( argv[0], "r" ); ! 135: else ! 136: yyin = stdin; ! 137: ! 138: yylex(); ! 139: } ! 140: ! 141: .fi ! 142: This is the beginnings of a simple scanner for a language like ! 143: Pascal. It identifies different types of ! 144: .I tokens ! 145: and reports on what it has seen. ! 146: .LP ! 147: The details of this example will be explained in the following ! 148: sections. ! 149: .SH FORMAT OF THE INPUT FILE ! 150: The ! 151: .I flex ! 152: input file consists of three sections, separated by a line with just ! 153: .B %% ! 154: in it: ! 155: .nf ! 156: ! 157: definitions ! 158: %% ! 159: rules ! 160: %% ! 161: user code ! 162: ! 163: .fi ! 164: The ! 165: .I definitions ! 166: section contains declarations of simple ! 167: .I name ! 168: definitions to simplify the scanner specification, and declarations of ! 169: .I start conditions, ! 170: which are explained in a later section. ! 171: .LP ! 172: Name definitions have the form: ! 173: .nf ! 174: ! 175: name definition ! 176: ! 177: .fi ! 178: The "name" is a word beginning with a letter or an underscore ('_') ! 179: followed by zero or more letters, digits, '_', or '-' (dash). ! 180: The definition is taken to begin at the first non-white-space character ! 181: following the name and continuing to the end of the line. ! 182: The definition can subsequently be referred to using "{name}", which ! 183: will expand to "(definition)". For example, ! 184: .nf ! 185: ! 186: DIGIT [0-9] ! 187: ID [a-z][a-z0-9]* ! 188: ! 189: .fi ! 190: defines "DIGIT" to be a regular expression which matches a ! 191: single digit, and ! 192: "ID" to be a regular expression which matches a letter ! 193: followed by zero-or-more letters-or-digits. ! 194: A subsequent reference to ! 195: .nf ! 196: ! 197: {DIGIT}+"."{DIGIT}* ! 198: ! 199: .fi ! 200: is identical to ! 201: .nf ! 202: ! 203: ([0-9])+"."([0-9])* ! 204: ! 205: .fi ! 206: and matches one-or-more digits followed by a '.' followed ! 207: by zero-or-more digits. ! 208: .LP ! 209: The ! 210: .I rules ! 211: section of the ! 212: .I flex ! 213: input contains a series of rules of the form: ! 214: .nf ! 215: ! 216: pattern action ! 217: ! 218: .fi ! 219: where the pattern must be unindented and the action must begin ! 220: on the same line. ! 221: .LP ! 222: See below for a further description of patterns and actions. ! 223: .LP ! 224: Finally, the user code section is simply copied to ! 225: .B lex.yy.c ! 226: verbatim. ! 227: It is used for companion routines which call or are called ! 228: by the scanner. The presence of this section is optional; ! 229: if it is missing, the second ! 230: .B %% ! 231: in the input file may be skipped, too. ! 232: .LP ! 233: In the definitions and rules sections, any ! 234: .I indented ! 235: text or text enclosed in ! 236: .B %{ ! 237: and ! 238: .B %} ! 239: is copied verbatim to the output (with the %{}'s removed). ! 240: The %{}'s must appear unindented on lines by themselves. ! 241: .LP ! 242: In the rules section, ! 243: any indented or %{} text appearing before the ! 244: first rule may be used to declare variables ! 245: which are local to the scanning routine and (after the declarations) ! 246: code which is to be executed whenever the scanning routine is entered. ! 247: Other indented or %{} text in the rule section is still copied to the output, ! 248: but its meaning is not well-defined and it may well cause compile-time ! 249: errors (this feature is present for ! 250: .I POSIX ! 251: compliance; see below for other such features). ! 252: .LP ! 253: In the definitions section, an unindented comment (i.e., a line ! 254: beginning with "/*") is also copied verbatim to the output up ! 255: to the next "*/". Also, any line in the definitions section ! 256: beginning with '#' is ignored, though this style of comment is ! 257: deprecated and may go away in the future. ! 258: .SH PATTERNS ! 259: The patterns in the input are written using an extended set of regular ! 260: expressions. These are: ! 261: .nf ! 262: ! 263: x match the character 'x' ! 264: . any character except newline ! 265: [xyz] a "character class"; in this case, the pattern ! 266: matches either an 'x', a 'y', or a 'z' ! 267: [abj-oZ] a "character class" with a range in it; matches ! 268: an 'a', a 'b', any letter from 'j' through 'o', ! 269: or a 'Z' ! 270: [^A-Z] a "negated character class", i.e., any character ! 271: but those in the class. In this case, any ! 272: character EXCEPT an uppercase letter. ! 273: [^A-Z\\n] any character EXCEPT an uppercase letter or ! 274: a newline ! 275: r* zero or more r's, where r is any regular expression ! 276: r+ one or more r's ! 277: r? zero or one r's (that is, "an optional r") ! 278: r{2,5} anywhere from two to five r's ! 279: r{2,} two or more r's ! 280: r{4} exactly 4 r's ! 281: {name} the expansion of the "name" definition ! 282: (see above) ! 283: "[xyz]\\"foo" ! 284: the literal string: [xyz]"foo ! 285: \\X if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', ! 286: then the ANSI-C interpretation of \\x. ! 287: Otherwise, a literal 'X' (used to escape ! 288: operators such as '*') ! 289: \\123 the character with octal value 123 ! 290: \\x2a the character with hexadecimal value 2a ! 291: (r) match an r; parentheses are used to override ! 292: precedence (see below) ! 293: ! 294: ! 295: rs the regular expression r followed by the ! 296: regular expression s; called "concatenation" ! 297: ! 298: ! 299: r|s either an r or an s ! 300: ! 301: ! 302: r/s an r but only if it is followed by an s. The ! 303: s is not part of the matched text. This type ! 304: of pattern is called as "trailing context". ! 305: ^r an r, but only at the beginning of a line ! 306: r$ an r, but only at the end of a line. Equivalent ! 307: to "r/\\n". ! 308: ! 309: ! 310: <s>r an r, but only in start condition s (see ! 311: below for discussion of start conditions) ! 312: <s1,s2,s3>r ! 313: same, but in any of start conditions s1, ! 314: s2, or s3 ! 315: ! 316: ! 317: <<EOF>> an end-of-file ! 318: <s1,s2><<EOF>> ! 319: an end-of-file when in start condition s1 or s2 ! 320: ! 321: .fi ! 322: The regular expressions listed above are grouped according to ! 323: precedence, from highest precedence at the top to lowest at the bottom. ! 324: Those grouped together have equal precedence. For example, ! 325: .nf ! 326: ! 327: foo|bar* ! 328: ! 329: .fi ! 330: is the same as ! 331: .nf ! 332: ! 333: (foo)|(ba(r*)) ! 334: ! 335: .fi ! 336: since the '*' operator has higher precedence than concatenation, ! 337: and concatenation higher than alternation ('|'). This pattern ! 338: therefore matches ! 339: .I either ! 340: the string "foo" ! 341: .I or ! 342: the string "ba" followed by zero-or-more r's. ! 343: To match "foo" or zero-or-more "bar"'s, use: ! 344: .nf ! 345: ! 346: foo|(bar)* ! 347: ! 348: .fi ! 349: and to match zero-or-more "foo"'s-or-"bar"'s: ! 350: .nf ! 351: ! 352: (foo|bar)* ! 353: ! 354: .fi ! 355: .LP ! 356: Some notes on patterns: ! 357: .IP - ! 358: A negated character class such as the example "[^A-Z]" ! 359: above ! 360: .I will match a newline ! 361: unless "\\n" (or an equivalent escape sequence) is one of the ! 362: characters explicitly present in the negated character class ! 363: (e.g., "[^A-Z\\n]"). This is unlike how many other regular ! 364: expression tools treat negated character classes, but unfortunately ! 365: the inconsistency is historically entrenched. ! 366: Matching newlines means that a pattern like [^"]* can match an entire ! 367: input (overflowing the scanner's input buffer) unless there's another ! 368: quote in the input. ! 369: .IP - ! 370: A rule can have at most one instance of trailing context (the '/' operator ! 371: or the '$' operator). The start condition, '^', and "<<EOF>>" patterns ! 372: can only occur at the beginning of a pattern, and, as well as with '/' and '$', ! 373: cannot be grouped inside parentheses. A '^' which does not occur at ! 374: the beginning of a rule or a '$' which does not occur at the end of ! 375: a rule loses its special properties and is treated as a normal character. ! 376: .IP ! 377: The following are illegal: ! 378: .nf ! 379: ! 380: foo/bar$ ! 381: <sc1>foo<sc2>bar ! 382: ! 383: .fi ! 384: Note that the first of these, can be written "foo/bar\\n". ! 385: .IP ! 386: The following will result in '$' or '^' being treated as a normal character: ! 387: .nf ! 388: ! 389: foo|(bar$) ! 390: foo|^bar ! 391: ! 392: .fi ! 393: If what's wanted is a "foo" or a bar-followed-by-a-newline, the following ! 394: could be used (the special '|' action is explained below): ! 395: .nf ! 396: ! 397: foo | ! 398: bar$ /* action goes here */ ! 399: ! 400: .fi ! 401: A similar trick will work for matching a foo or a ! 402: bar-at-the-beginning-of-a-line. ! 403: .SH HOW THE INPUT IS MATCHED ! 404: When the generated scanner is run, it analyzes its input looking ! 405: for strings which match any of its patterns. If it finds more than ! 406: one match, it takes the one matching the most text (for trailing ! 407: context rules, this includes the length of the trailing part, even ! 408: though it will then be returned to the input). If it finds two ! 409: or more matches of the same length, the ! 410: rule listed first in the ! 411: .I flex ! 412: input file is chosen. ! 413: .LP ! 414: Once the match is determined, the text corresponding to the match ! 415: (called the ! 416: .I token) ! 417: is made available in the global character pointer ! 418: .B yytext, ! 419: and its length in the global integer ! 420: .B yyleng. ! 421: The ! 422: .I action ! 423: corresponding to the matched pattern is then executed (a more ! 424: detailed description of actions follows), and then the remaining ! 425: input is scanned for another match. ! 426: .LP ! 427: If no match is found, then the ! 428: .I default rule ! 429: is executed: the next character in the input is considered matched and ! 430: copied to the standard output. Thus, the simplest legal ! 431: .I flex ! 432: input is: ! 433: .nf ! 434: ! 435: %% ! 436: ! 437: .fi ! 438: which generates a scanner that simply copies its input (one character ! 439: at a time) to its output. ! 440: .SH ACTIONS ! 441: Each pattern in a rule has a corresponding action, which can be any ! 442: arbitrary C statement. The pattern ends at the first non-escaped ! 443: whitespace character; the remainder of the line is its action. If the ! 444: action is empty, then when the pattern is matched the input token ! 445: is simply discarded. For example, here is the specification for a program ! 446: which deletes all occurrences of "zap me" from its input: ! 447: .nf ! 448: ! 449: %% ! 450: "zap me" ! 451: ! 452: .fi ! 453: (It will copy all other characters in the input to the output since ! 454: they will be matched by the default rule.) ! 455: .LP ! 456: Here is a program which compresses multiple blanks and tabs down to ! 457: a single blank, and throws away whitespace found at the end of a line: ! 458: .nf ! 459: ! 460: %% ! 461: [ \\t]+ putchar( ' ' ); ! 462: [ \\t]+$ /* ignore this token */ ! 463: ! 464: .fi ! 465: .LP ! 466: If the action contains a '{', then the action spans till the balancing '}' ! 467: is found, and the action may cross multiple lines. ! 468: .I flex ! 469: knows about C strings and comments and won't be fooled by braces found ! 470: within them, but also allows actions to begin with ! 471: .B %{ ! 472: and will consider the action to be all the text up to the next ! 473: .B %} ! 474: (regardless of ordinary braces inside the action). ! 475: .LP ! 476: An action consisting solely of a vertical bar ('|') means "same as ! 477: the action for the next rule." See below for an illustration. ! 478: .LP ! 479: Actions can include arbitrary C code, including ! 480: .B return ! 481: statements to return a value to whatever routine called ! 482: .B yylex(). ! 483: Each time ! 484: .B yylex() ! 485: is called it continues processing tokens from where it last left ! 486: off until it either reaches ! 487: the end of the file or executes a return. Once it reaches an end-of-file, ! 488: however, then any subsequent call to ! 489: .B yylex() ! 490: will simply immediately return, unless ! 491: .B yyrestart() ! 492: is first called (see below). ! 493: .LP ! 494: Actions are not allowed to modify yytext or yyleng. ! 495: .LP ! 496: There are a number of special directives which can be included within ! 497: an action: ! 498: .IP - ! 499: .B ECHO ! 500: copies yytext to the scanner's output. ! 501: .IP - ! 502: .B BEGIN ! 503: followed by the name of a start condition places the scanner in the ! 504: corresponding start condition (see below). ! 505: .IP - ! 506: .B REJECT ! 507: directs the scanner to proceed on to the "second best" rule which matched the ! 508: input (or a prefix of the input). The rule is chosen as described ! 509: above in "How the Input is Matched", and ! 510: .B yytext ! 511: and ! 512: .B yyleng ! 513: set up appropriately. ! 514: It may either be one which matched as much text ! 515: as the originally chosen rule but came later in the ! 516: .I flex ! 517: input file, or one which matched less text. ! 518: For example, the following will both count the ! 519: words in the input and call the routine special() whenever "frob" is seen: ! 520: .nf ! 521: ! 522: int word_count = 0; ! 523: %% ! 524: ! 525: frob special(); REJECT; ! 526: [^ \\t\\n]+ ++word_count; ! 527: ! 528: .fi ! 529: Without the ! 530: .B REJECT, ! 531: any "frob"'s in the input would not be counted as words, since the ! 532: scanner normally executes only one action per token. ! 533: Multiple ! 534: .B REJECT's ! 535: are allowed, each one finding the next best choice to the currently ! 536: active rule. For example, when the following scanner scans the token ! 537: "abcd", it will write "abcdabcaba" to the output: ! 538: .nf ! 539: ! 540: %% ! 541: a | ! 542: ab | ! 543: abc | ! 544: abcd ECHO; REJECT; ! 545: .|\\n /* eat up any unmatched character */ ! 546: ! 547: .fi ! 548: (The first three rules share the fourth's action since they use ! 549: the special '|' action.) ! 550: .B REJECT ! 551: is a particularly expensive feature in terms scanner performance; ! 552: if it is used in ! 553: .I any ! 554: of the scanner's actions it will slow down ! 555: .I all ! 556: of the scanner's matching. Furthermore, ! 557: .B REJECT ! 558: cannot be used with the ! 559: .I -f ! 560: or ! 561: .I -F ! 562: options (see below). ! 563: .IP ! 564: Note also that unlike the other special actions, ! 565: .B REJECT ! 566: is a ! 567: .I branch; ! 568: code immediately following it in the action will ! 569: .I not ! 570: be executed. ! 571: .IP - ! 572: .B yymore() ! 573: tells the scanner that the next time it matches a rule, the corresponding ! 574: token should be ! 575: .I appended ! 576: onto the current value of ! 577: .B yytext ! 578: rather than replacing it. For example, given the input "mega-kludge" ! 579: the following will write "mega-mega-kludge" to the output: ! 580: .nf ! 581: ! 582: %% ! 583: mega- ECHO; yymore(); ! 584: kludge ECHO; ! 585: ! 586: .fi ! 587: First "mega-" is matched and echoed to the output. Then "kludge" ! 588: is matched, but the previous "mega-" is still hanging around at the ! 589: beginning of ! 590: .B yytext ! 591: so the ! 592: .B ECHO ! 593: for the "kludge" rule will actually write "mega-kludge". ! 594: The presence of ! 595: .B yymore() ! 596: in the scanner's action entails a minor performance penalty in the ! 597: scanner's matching speed. ! 598: .IP - ! 599: .B yyless(n) ! 600: returns all but the first ! 601: .I n ! 602: characters of the current token back to the input stream, where they ! 603: will be rescanned when the scanner looks for the next match. ! 604: .B yytext ! 605: and ! 606: .B yyleng ! 607: are adjusted appropriately (e.g., ! 608: .B yyleng ! 609: will now be equal to ! 610: .I n ! 611: ). For example, on the input "foobar" the following will write out ! 612: "foobarbar": ! 613: .nf ! 614: ! 615: %% ! 616: foobar ECHO; yyless(3); ! 617: [a-z]+ ECHO; ! 618: ! 619: .fi ! 620: An argument of 0 to ! 621: .B yyless ! 622: will cause the entire current input string to be scanned again. Unless you've ! 623: changed how the scanner will subsequently process its input (using ! 624: .B BEGIN, ! 625: for example), this will result in an endless loop. ! 626: .IP - ! 627: .B unput(c) ! 628: puts the character ! 629: .I c ! 630: back onto the input stream. It will be the next character scanned. ! 631: The following action will take the current token and cause it ! 632: to be rescanned enclosed in parentheses. ! 633: .nf ! 634: ! 635: { ! 636: int i; ! 637: unput( ')' ); ! 638: for ( i = yyleng - 1; i >= 0; --i ) ! 639: unput( yytext[i] ); ! 640: unput( '(' ); ! 641: } ! 642: ! 643: .fi ! 644: Note that since each ! 645: .B unput() ! 646: puts the given character back at the ! 647: .I beginning ! 648: of the input stream, pushing back strings must be done back-to-front. ! 649: .IP - ! 650: .B input() ! 651: reads the next character from the input stream. For example, ! 652: the following is one way to eat up C comments: ! 653: .nf ! 654: ! 655: %% ! 656: "/*" { ! 657: register int c; ! 658: ! 659: for ( ; ; ) ! 660: { ! 661: while ( (c = input()) != '*' && ! 662: c != EOF ) ! 663: ; /* eat up text of comment */ ! 664: ! 665: if ( c == '*' ) ! 666: { ! 667: while ( (c = input()) == '*' ) ! 668: ; ! 669: if ( c == '/' ) ! 670: break; /* found the end */ ! 671: } ! 672: ! 673: if ( c == EOF ) ! 674: { ! 675: error( "EOF in comment" ); ! 676: break; ! 677: } ! 678: } ! 679: } ! 680: ! 681: .fi ! 682: (Note that if the scanner is compiled using ! 683: .B C++, ! 684: then ! 685: .B input() ! 686: is instead referred to as ! 687: .B yyinput(), ! 688: in order to avoid a name clash with the ! 689: .B C++ ! 690: stream by the name of ! 691: .I input.) ! 692: .IP - ! 693: .B yyterminate() ! 694: can be used in lieu of a return statement in an action. It terminates ! 695: the scanner and returns a 0 to the scanner's caller, indicating "all done". ! 696: Subsequent calls to the scanner will immediately return unless preceded ! 697: by a call to ! 698: .B yyrestart() ! 699: (see below). ! 700: By default, ! 701: .B yyterminate() ! 702: is also called when an end-of-file is encountered. It is a macro and ! 703: may be redefined. ! 704: .SH THE GENERATED SCANNER ! 705: The output of ! 706: .I flex ! 707: is the file ! 708: .B lex.yy.c, ! 709: which contains the scanning routine ! 710: .B yylex(), ! 711: a number of tables used by it for matching tokens, and a number ! 712: of auxiliary routines and macros. By default, ! 713: .B yylex() ! 714: is declared as follows: ! 715: .nf ! 716: ! 717: int yylex() ! 718: { ! 719: ... various definitions and the actions in here ... ! 720: } ! 721: ! 722: .fi ! 723: (If your environment supports function prototypes, then it will ! 724: be "int yylex( void )".) This definition may be changed by redefining ! 725: the "YY_DECL" macro. For example, you could use: ! 726: .nf ! 727: ! 728: #undef YY_DECL ! 729: #define YY_DECL float lexscan( a, b ) float a, b; ! 730: ! 731: .fi ! 732: to give the scanning routine the name ! 733: .I lexscan, ! 734: returning a float, and taking two floats as arguments. Note that ! 735: if you give arguments to the scanning routine using a ! 736: K&R-style/non-prototyped function declaration, you must terminate ! 737: the definition with a semi-colon (;). ! 738: .LP ! 739: Whenever ! 740: .B yylex() ! 741: is called, it scans tokens from the global input file ! 742: .I yyin ! 743: (which defaults to stdin). It continues until it either reaches ! 744: an end-of-file (at which point it returns the value 0) or ! 745: one of its actions executes a ! 746: .I return ! 747: statement. ! 748: In the former case, when called again the scanner will immediately ! 749: return unless ! 750: .B yyrestart() ! 751: is called to point ! 752: .I yyin ! 753: at the new input file. ( ! 754: .B yyrestart() ! 755: takes one argument, a ! 756: .B FILE * ! 757: pointer.) ! 758: In the latter case (i.e., when an action ! 759: executes a return), the scanner may then be called again and it ! 760: will resume scanning where it left off. ! 761: .LP ! 762: By default (and for purposes of efficiency), the scanner uses ! 763: block-reads rather than simple ! 764: .I getc() ! 765: calls to read characters from ! 766: .I yyin. ! 767: The nature of how it gets its input can be controlled by redefining the ! 768: .B YY_INPUT ! 769: macro. ! 770: YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)". Its ! 771: action is to place up to ! 772: .I max_size ! 773: characters in the character array ! 774: .I buf ! 775: and return in the integer variable ! 776: .I result ! 777: either the ! 778: number of characters read or the constant YY_NULL (0 on Unix systems) ! 779: to indicate EOF. The default YY_INPUT reads from the ! 780: global file-pointer "yyin". ! 781: .LP ! 782: A sample redefinition of YY_INPUT (in the definitions ! 783: section of the input file): ! 784: .nf ! 785: ! 786: %{ ! 787: #undef YY_INPUT ! 788: #define YY_INPUT(buf,result,max_size) \\ ! 789: result = ((buf[0] = getchar()) == EOF) ? YY_NULL : 1; ! 790: %} ! 791: ! 792: .fi ! 793: This definition will change the input processing to occur ! 794: one character at a time. ! 795: .LP ! 796: You also can add in things like keeping track of the ! 797: input line number this way; but don't expect your scanner to ! 798: go very fast. ! 799: .LP ! 800: When the scanner receives an end-of-file indication from YY_INPUT, ! 801: it then checks the ! 802: .B yywrap() ! 803: function. If ! 804: .B yywrap() ! 805: returns false (zero), then it is assumed that the ! 806: function has gone ahead and set up ! 807: .I yyin ! 808: to point to another input file, and scanning continues. If it returns ! 809: true (non-zero), then the scanner terminates, returning 0 to its ! 810: caller. ! 811: .LP ! 812: The default ! 813: .B yywrap() ! 814: always returns 1. Presently, to redefine it you must first ! 815: "#undef yywrap", as it is currently implemented as a macro. As indicated ! 816: by the hedging in the previous sentence, it may be changed to ! 817: a true function in the near future. ! 818: .LP ! 819: The scanner writes its ! 820: .B ECHO ! 821: output to the ! 822: .I yyout ! 823: global (default, stdout), which may be redefined by the user simply ! 824: by assigning it to some other ! 825: .B FILE ! 826: pointer. ! 827: .SH START CONDITIONS ! 828: .I flex ! 829: provides a mechanism for conditionally activating rules. Any rule ! 830: whose pattern is prefixed with "<sc>" will only be active when ! 831: the scanner is in the start condition named "sc". For example, ! 832: .nf ! 833: ! 834: <STRING>[^"]* { /* eat up the string body ... */ ! 835: ... ! 836: } ! 837: ! 838: .fi ! 839: will be active only when the scanner is in the "STRING" start ! 840: condition, and ! 841: .nf ! 842: ! 843: <INITIAL,STRING,QUOTE>\\. { /* handle an escape ... */ ! 844: ... ! 845: } ! 846: ! 847: .fi ! 848: will be active only when the current start condition is ! 849: either "INITIAL", "STRING", or "QUOTE". ! 850: .LP ! 851: Start conditions ! 852: are declared in the definitions (first) section of the input ! 853: using unindented lines beginning with either ! 854: .B %s ! 855: or ! 856: .B %x ! 857: followed by a list of names. ! 858: The former declares ! 859: .I inclusive ! 860: start conditions, the latter ! 861: .I exclusive ! 862: start conditions. A start condition is activated using the ! 863: .B BEGIN ! 864: action. Until the next ! 865: .B BEGIN ! 866: action is executed, rules with the given start ! 867: condition will be active and ! 868: rules with other start conditions will be inactive. ! 869: If the start condition is ! 870: .I inclusive, ! 871: then rules with no start conditions at all will also be active. ! 872: If it is ! 873: .I exclusive, ! 874: then ! 875: .I only ! 876: rules qualified with the start condition will be active. ! 877: A set of rules contingent on the same exclusive start condition ! 878: describe a scanner which is independent of any of the other rules in the ! 879: .I flex ! 880: input. Because of this, ! 881: exclusive start conditions make it easy to specify "mini-scanners" ! 882: which scan portions of the input that are syntactically different ! 883: from the rest (e.g., comments). ! 884: .LP ! 885: If the distinction between inclusive and exclusive start conditions ! 886: is still a little vague, here's a simple example illustrating the ! 887: connection between the two. The set of rules: ! 888: .nf ! 889: ! 890: %s example ! 891: %% ! 892: <example>foo /* do something */ ! 893: ! 894: .fi ! 895: is equivalent to ! 896: .nf ! 897: ! 898: %x example ! 899: %% ! 900: <INITIAL,example>foo /* do something */ ! 901: ! 902: .fi ! 903: .LP ! 904: The default rule (to ! 905: .B ECHO ! 906: any unmatched character) remains active in start conditions. ! 907: .LP ! 908: .B BEGIN(0) ! 909: returns to the original state where only the rules with ! 910: no start conditions are active. This state can also be ! 911: referred to as the start-condition "INITIAL", so ! 912: .B BEGIN(INITIAL) ! 913: is equivalent to ! 914: .B BEGIN(0). ! 915: (The parentheses around the start condition name are not required but ! 916: are considered good style.) ! 917: .LP ! 918: .B BEGIN ! 919: actions can also be given as indented code at the beginning ! 920: of the rules section. For example, the following will cause ! 921: the scanner to enter the "SPECIAL" start condition whenever ! 922: .I yylex() ! 923: is called and the global variable ! 924: .I enter_special ! 925: is true: ! 926: .nf ! 927: ! 928: int enter_special; ! 929: ! 930: %x SPECIAL ! 931: %% ! 932: if ( enter_special ) ! 933: BEGIN(SPECIAL); ! 934: ! 935: <SPECIAL>blahblahblah ! 936: ...more rules follow... ! 937: ! 938: .fi ! 939: .LP ! 940: To illustrate the uses of start conditions, ! 941: here is a scanner which provides two different interpretations ! 942: of a string like "123.456". By default it will treat it as ! 943: as three tokens, the integer "123", a dot ('.'), and the integer "456". ! 944: But if the string is preceded earlier in the line by the string ! 945: "expect-floats" ! 946: it will treat it as a single token, the floating-point number ! 947: 123.456: ! 948: .nf ! 949: ! 950: %{ ! 951: #include <math.h> ! 952: %} ! 953: %s expect ! 954: ! 955: %% ! 956: expect-floats BEGIN(expect); ! 957: ! 958: <expect>[0-9]+"."[0-9]+ { ! 959: printf( "found a float, = %f\\n", ! 960: atof( yytext ) ); ! 961: } ! 962: <expect>\\n { ! 963: /* that's the end of the line, so ! 964: * we need another "expect-number" ! 965: * before we'll recognize any more ! 966: * numbers ! 967: */ ! 968: BEGIN(INITIAL); ! 969: } ! 970: ! 971: [0-9]+ { ! 972: printf( "found an integer, = %d\\n", ! 973: atoi( yytext ) ); ! 974: } ! 975: ! 976: "." printf( "found a dot\\n" ); ! 977: ! 978: .fi ! 979: Here is a scanner which recognizes (and discards) C comments while ! 980: maintaining a count of the current input line. ! 981: .nf ! 982: ! 983: %x comment ! 984: %% ! 985: int line_num = 1; ! 986: ! 987: "/*" BEGIN(comment); ! 988: ! 989: <comment>[^*\\n]* /* eat anything that's not a '*' */ ! 990: <comment>"*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */ ! 991: <comment>\\n ++line_num; ! 992: <comment>"*"+"/" BEGIN(INITIAL); ! 993: ! 994: .fi ! 995: Note that start-conditions names are really integer values and ! 996: can be stored as such. Thus, the above could be extended in the ! 997: following fashion: ! 998: .nf ! 999: ! 1000: %x comment foo ! 1001: %% ! 1002: int line_num = 1; ! 1003: int comment_caller; ! 1004: ! 1005: "/*" { ! 1006: comment_caller = INITIAL; ! 1007: BEGIN(comment); ! 1008: } ! 1009: ! 1010: ... ! 1011: ! 1012: <foo>"/*" { ! 1013: comment_caller = foo; ! 1014: BEGIN(comment); ! 1015: } ! 1016: ! 1017: <comment>[^*\\n]* /* eat anything that's not a '*' */ ! 1018: <comment>"*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */ ! 1019: <comment>\\n ++line_num; ! 1020: <comment>"*"+"/" BEGIN(comment_caller); ! 1021: ! 1022: .fi ! 1023: One can then implement a "stack" of start conditions using an ! 1024: array of integers. (It is likely that such stacks will become ! 1025: a full-fledged ! 1026: .I flex ! 1027: feature in the future.) Note, though, that ! 1028: start conditions do not have their own name-space; %s's and %x's ! 1029: declare names in the same fashion as #define's. ! 1030: .SH MULTIPLE INPUT BUFFERS ! 1031: Some scanners (such as those which support "include" files) ! 1032: require reading from several input streams. As ! 1033: .I flex ! 1034: scanners do a large amount of buffering, one cannot control ! 1035: where the next input will be read from by simply writing a ! 1036: .B YY_INPUT ! 1037: which is sensitive to the scanning context. ! 1038: .B YY_INPUT ! 1039: is only called when the scanner reaches the end of its buffer, which ! 1040: may be a long time after scanning a statement such as an "include" ! 1041: which requires switching the input source. ! 1042: .LP ! 1043: To negotiate these sorts of problems, ! 1044: .I flex ! 1045: provides a mechanism for creating and switching between multiple ! 1046: input buffers. An input buffer is created by using: ! 1047: .nf ! 1048: ! 1049: YY_BUFFER_STATE yy_create_buffer( FILE *file, int size ) ! 1050: ! 1051: .fi ! 1052: which takes a ! 1053: .I FILE ! 1054: pointer and a size and creates a buffer associated with the given ! 1055: file and large enough to hold ! 1056: .I size ! 1057: characters (when in doubt, use ! 1058: .B YY_BUF_SIZE ! 1059: for the size). It returns a ! 1060: .B YY_BUFFER_STATE ! 1061: handle, which may then be passed to other routines: ! 1062: .nf ! 1063: ! 1064: void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer ) ! 1065: ! 1066: .fi ! 1067: switches the scanner's input buffer so subsequent tokens will ! 1068: come from ! 1069: .I new_buffer. ! 1070: Note that ! 1071: .B yy_switch_to_buffer() ! 1072: may be used by yywrap() to sets things up for continued scanning, instead ! 1073: of opening a new file and pointing ! 1074: .I yyin ! 1075: at it. ! 1076: .nf ! 1077: ! 1078: void yy_delete_buffer( YY_BUFFER_STATE buffer ) ! 1079: ! 1080: .fi ! 1081: is used to reclaim the storage associated with a buffer. ! 1082: .LP ! 1083: .B yy_new_buffer() ! 1084: is an alias for ! 1085: .B yy_create_buffer(), ! 1086: provided for compatibility with the C++ use of ! 1087: .I new ! 1088: and ! 1089: .I delete ! 1090: for creating and destroying dynamic objects. ! 1091: .LP ! 1092: Finally, the ! 1093: .B YY_CURRENT_BUFFER ! 1094: macro returns a ! 1095: .B YY_BUFFER_STATE ! 1096: handle to the current buffer. ! 1097: .LP ! 1098: Here is an example of using these features for writing a scanner ! 1099: which expands include files (the ! 1100: .B <<EOF>> ! 1101: feature is discussed below): ! 1102: .nf ! 1103: ! 1104: /* the "incl" state is used for picking up the name ! 1105: * of an include file ! 1106: */ ! 1107: %x incl ! 1108: ! 1109: %{ ! 1110: #define MAX_INCLUDE_DEPTH 10 ! 1111: YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH]; ! 1112: int include_stack_ptr = 0; ! 1113: %} ! 1114: ! 1115: %% ! 1116: include BEGIN(incl); ! 1117: ! 1118: [a-z]+ ECHO; ! 1119: [^a-z\\n]*\\n? ECHO; ! 1120: ! 1121: <incl>[ \\t]* /* eat the whitespace */ ! 1122: <incl>[^ \\t\\n]+ { /* got the include file name */ ! 1123: if ( include_stack_ptr >= MAX_INCLUDE_DEPTH ) ! 1124: { ! 1125: fprintf( stderr, "Includes nested too deeply" ); ! 1126: exit( 1 ); ! 1127: } ! 1128: ! 1129: include_stack[include_stack_ptr++] = ! 1130: YY_CURRENT_BUFFER; ! 1131: ! 1132: yyin = fopen( yytext, "r" ); ! 1133: ! 1134: if ( ! yyin ) ! 1135: error( ... ); ! 1136: ! 1137: yy_switch_to_buffer( ! 1138: yy_create_buffer( yyin, YY_BUF_SIZE ) ); ! 1139: ! 1140: BEGIN(INITIAL); ! 1141: } ! 1142: ! 1143: <<EOF>> { ! 1144: if ( --include_stack_ptr < 0 ) ! 1145: { ! 1146: yyterminate(); ! 1147: } ! 1148: ! 1149: else ! 1150: yy_switch_to_buffer( ! 1151: include_stack[include_stack_ptr] ); ! 1152: } ! 1153: ! 1154: .fi ! 1155: .SH END-OF-FILE RULES ! 1156: The special rule "<<EOF>>" indicates ! 1157: actions which are to be taken when an end-of-file is ! 1158: encountered and yywrap() returns non-zero (i.e., indicates ! 1159: no further files to process). The action must finish ! 1160: by doing one of four things: ! 1161: .IP - ! 1162: the special ! 1163: .B YY_NEW_FILE ! 1164: action, if ! 1165: .I yyin ! 1166: has been pointed at a new file to process; ! 1167: .IP - ! 1168: a ! 1169: .I return ! 1170: statement; ! 1171: .IP - ! 1172: the special ! 1173: .B yyterminate() ! 1174: action; ! 1175: .IP - ! 1176: or, switching to a new buffer using ! 1177: .B yy_switch_to_buffer() ! 1178: as shown in the example above. ! 1179: .LP ! 1180: <<EOF>> rules may not be used with other ! 1181: patterns; they may only be qualified with a list of start ! 1182: conditions. If an unqualified <<EOF>> rule is given, it ! 1183: applies to ! 1184: .I all ! 1185: start conditions which do not already have <<EOF>> actions. To ! 1186: specify an <<EOF>> rule for only the initial start condition, use ! 1187: .nf ! 1188: ! 1189: <INITIAL><<EOF>> ! 1190: ! 1191: .fi ! 1192: .LP ! 1193: These rules are useful for catching things like unclosed comments. ! 1194: An example: ! 1195: .nf ! 1196: ! 1197: %x quote ! 1198: %% ! 1199: ! 1200: ...other rules for dealing with quotes... ! 1201: ! 1202: <quote><<EOF>> { ! 1203: error( "unterminated quote" ); ! 1204: yyterminate(); ! 1205: } ! 1206: <<EOF>> { ! 1207: if ( *++filelist ) ! 1208: { ! 1209: yyin = fopen( *filelist, "r" ); ! 1210: YY_NEW_FILE; ! 1211: } ! 1212: else ! 1213: yyterminate(); ! 1214: } ! 1215: ! 1216: .fi ! 1217: .SH MISCELLANEOUS MACROS ! 1218: The macro ! 1219: .bd ! 1220: YY_USER_ACTION ! 1221: can be redefined to provide an action ! 1222: which is always executed prior to the matched rule's action. For example, ! 1223: it could be #define'd to call a routine to convert yytext to lower-case. ! 1224: .LP ! 1225: The macro ! 1226: .B YY_USER_INIT ! 1227: may be redefined to provide an action which is always executed before ! 1228: the first scan (and before the scanner's internal initializations are done). ! 1229: For example, it could be used to call a routine to read ! 1230: in a data table or open a logging file. ! 1231: .LP ! 1232: In the generated scanner, the actions are all gathered in one large ! 1233: switch statement and separated using ! 1234: .B YY_BREAK, ! 1235: which may be redefined. By default, it is simply a "break", to separate ! 1236: each rule's action from the following rule's. ! 1237: Redefining ! 1238: .B YY_BREAK ! 1239: allows, for example, C++ users to ! 1240: #define YY_BREAK to do nothing (while being very careful that every ! 1241: rule ends with a "break" or a "return"!) to avoid suffering from ! 1242: unreachable statement warnings where because a rule's action ends with ! 1243: "return", the ! 1244: .B YY_BREAK ! 1245: is inaccessible. ! 1246: .SH INTERFACING WITH YACC ! 1247: One of the main uses of ! 1248: .I flex ! 1249: is as a companion to the ! 1250: .I yacc ! 1251: parser-generator. ! 1252: .I yacc ! 1253: parsers expect to call a routine named ! 1254: .B yylex() ! 1255: to find the next input token. The routine is supposed to ! 1256: return the type of the next token as well as putting any associated ! 1257: value in the global ! 1258: .B yylval. ! 1259: To use ! 1260: .I flex ! 1261: with ! 1262: .I yacc, ! 1263: one specifies the ! 1264: .B -d ! 1265: option to ! 1266: .I yacc ! 1267: to instruct it to generate the file ! 1268: .B y.tab.h ! 1269: containing definitions of all the ! 1270: .B %tokens ! 1271: appearing in the ! 1272: .I yacc ! 1273: input. This file is then included in the ! 1274: .I flex ! 1275: scanner. For example, if one of the tokens is "TOK_NUMBER", ! 1276: part of the scanner might look like: ! 1277: .nf ! 1278: ! 1279: %{ ! 1280: #include "y.tab.h" ! 1281: %} ! 1282: ! 1283: %% ! 1284: ! 1285: [0-9]+ yylval = atoi( yytext ); return TOK_NUMBER; ! 1286: ! 1287: .fi ! 1288: .SH TRANSLATION TABLE ! 1289: In the name of POSIX compliance, ! 1290: .I flex ! 1291: supports a ! 1292: .I translation table ! 1293: for mapping input characters into groups. ! 1294: The table is specified in the first section, and its format looks like: ! 1295: .nf ! 1296: ! 1297: %t ! 1298: 1 abcd ! 1299: 2 ABCDEFGHIJKLMNOPQRSTUVWXYZ ! 1300: 52 0123456789 ! 1301: 6 \\t\\ \\n ! 1302: %t ! 1303: ! 1304: .fi ! 1305: This example specifies that the characters 'a', 'b', 'c', and 'd' ! 1306: are to all be lumped into group #1, upper-case letters ! 1307: in group #2, digits in group #52, tabs, blanks, and newlines into ! 1308: group #6, and ! 1309: .I ! 1310: no other characters will appear in the patterns. ! 1311: The group numbers are actually disregarded by ! 1312: .I flex; ! 1313: .B %t ! 1314: serves, though, to lump characters together. Given the above ! 1315: table, for example, the pattern "a(AA)*5" is equivalent to "d(ZQ)*0". ! 1316: They both say, "match any character in group #1, followed by ! 1317: zero-or-more pairs of characters ! 1318: from group #2, followed by a character from group #52." Thus ! 1319: .B %t ! 1320: provides a crude way for introducing equivalence classes into ! 1321: the scanner specification. ! 1322: .LP ! 1323: Note that the ! 1324: .B -i ! 1325: option (see below) coupled with the equivalence classes which ! 1326: .I flex ! 1327: automatically generates take care of virtually all the instances ! 1328: when one might consider using ! 1329: .B %t. ! 1330: But what the hell, it's there if you want it. ! 1331: .SH OPTIONS ! 1332: .I flex ! 1333: has the following options: ! 1334: .TP ! 1335: .B -b ! 1336: Generate backtracking information to ! 1337: .I lex.backtrack. ! 1338: This is a list of scanner states which require backtracking ! 1339: and the input characters on which they do so. By adding rules one ! 1340: can remove backtracking states. If all backtracking states ! 1341: are eliminated and ! 1342: .B -f ! 1343: or ! 1344: .B -F ! 1345: is used, the generated scanner will run faster (see the ! 1346: .B -p ! 1347: flag). Only users who wish to squeeze every last cycle out of their ! 1348: scanners need worry about this option. (See the section on PERFORMANCE ! 1349: CONSIDERATIONS below.) ! 1350: .TP ! 1351: .B -c ! 1352: is a do-nothing, deprecated option included for POSIX compliance. ! 1353: .IP ! 1354: .B NOTE: ! 1355: in previous releases of ! 1356: .I flex ! 1357: .B -c ! 1358: specified table-compression options. This functionality is ! 1359: now given by the ! 1360: .B -C ! 1361: flag. To ease the the impact of this change, when ! 1362: .I flex ! 1363: encounters ! 1364: .B -c, ! 1365: it currently issues a warning message and assumes that ! 1366: .B -C ! 1367: was desired instead. In the future this "promotion" of ! 1368: .B -c ! 1369: to ! 1370: .B -C ! 1371: will go away in the name of full POSIX compliance (unless ! 1372: the POSIX meaning is removed first). ! 1373: .TP ! 1374: .B -d ! 1375: makes the generated scanner run in ! 1376: .I debug ! 1377: mode. Whenever a pattern is recognized and the global ! 1378: .B yy_flex_debug ! 1379: is non-zero (which is the default), ! 1380: the scanner will write to ! 1381: .I stderr ! 1382: a line of the form: ! 1383: .nf ! 1384: ! 1385: --accepting rule at line 53 ("the matched text") ! 1386: ! 1387: .fi ! 1388: The line number refers to the location of the rule in the file ! 1389: defining the scanner (i.e., the file that was fed to flex). Messages ! 1390: are also generated when the scanner backtracks, accepts the ! 1391: default rule, reaches the end of its input buffer (or encounters ! 1392: a NUL; at this point, the two look the same as far as the scanner's concerned), ! 1393: or reaches an end-of-file. ! 1394: .TP ! 1395: .B -f ! 1396: specifies (take your pick) ! 1397: .I full table ! 1398: or ! 1399: .I fast scanner. ! 1400: No table compression is done. The result is large but fast. ! 1401: This option is equivalent to ! 1402: .B -Cf ! 1403: (see below). ! 1404: .TP ! 1405: .B -i ! 1406: instructs ! 1407: .I flex ! 1408: to generate a ! 1409: .I case-insensitive ! 1410: scanner. The case of letters given in the ! 1411: .I flex ! 1412: input patterns will ! 1413: be ignored, and tokens in the input will be matched regardless of case. The ! 1414: matched text given in ! 1415: .I yytext ! 1416: will have the preserved case (i.e., it will not be folded). ! 1417: .TP ! 1418: .B -n ! 1419: is another do-nothing, deprecated option included only for ! 1420: POSIX compliance. ! 1421: .TP ! 1422: .B -p ! 1423: generates a performance report to stderr. The report ! 1424: consists of comments regarding features of the ! 1425: .I flex ! 1426: input file which will cause a loss of performance in the resulting scanner. ! 1427: Note that the use of ! 1428: .I REJECT ! 1429: and variable trailing context (see the BUGS section in flex(1)) ! 1430: entails a substantial performance penalty; use of ! 1431: .I yymore(), ! 1432: the ! 1433: .B ^ ! 1434: operator, ! 1435: and the ! 1436: .B -I ! 1437: flag entail minor performance penalties. ! 1438: .TP ! 1439: .B -s ! 1440: causes the ! 1441: .I default rule ! 1442: (that unmatched scanner input is echoed to ! 1443: .I stdout) ! 1444: to be suppressed. If the scanner encounters input that does not ! 1445: match any of its rules, it aborts with an error. This option is ! 1446: useful for finding holes in a scanner's rule set. ! 1447: .TP ! 1448: .B -t ! 1449: instructs ! 1450: .I flex ! 1451: to write the scanner it generates to standard output instead ! 1452: of ! 1453: .B lex.yy.c. ! 1454: .TP ! 1455: .B -v ! 1456: specifies that ! 1457: .I flex ! 1458: should write to ! 1459: .I stderr ! 1460: a summary of statistics regarding the scanner it generates. ! 1461: Most of the statistics are meaningless to the casual ! 1462: .I flex ! 1463: user, but the ! 1464: first line identifies the version of ! 1465: .I flex, ! 1466: which is useful for figuring ! 1467: out where you stand with respect to patches and new releases, ! 1468: and the next two lines give the date when the scanner was created ! 1469: and a summary of the flags which were in effect. ! 1470: .TP ! 1471: .B -F ! 1472: specifies that the ! 1473: .ul ! 1474: fast ! 1475: scanner table representation should be used. This representation is ! 1476: about as fast as the full table representation ! 1477: .ul ! 1478: (-f), ! 1479: and for some sets of patterns will be considerably smaller (and for ! 1480: others, larger). In general, if the pattern set contains both "keywords" ! 1481: and a catch-all, "identifier" rule, such as in the set: ! 1482: .nf ! 1483: ! 1484: "case" return TOK_CASE; ! 1485: "switch" return TOK_SWITCH; ! 1486: ... ! 1487: "default" return TOK_DEFAULT; ! 1488: [a-z]+ return TOK_ID; ! 1489: ! 1490: .fi ! 1491: then you're better off using the full table representation. If only ! 1492: the "identifier" rule is present and you then use a hash table or some such ! 1493: to detect the keywords, you're better off using ! 1494: .ul ! 1495: -F. ! 1496: .IP ! 1497: This option is equivalent to ! 1498: .B -CF ! 1499: (see below). ! 1500: .TP ! 1501: .B -I ! 1502: instructs ! 1503: .I flex ! 1504: to generate an ! 1505: .I interactive ! 1506: scanner. Normally, scanners generated by ! 1507: .I flex ! 1508: always look ahead one ! 1509: character before deciding that a rule has been matched. At the cost of ! 1510: some scanning overhead, ! 1511: .I flex ! 1512: will generate a scanner which only looks ahead ! 1513: when needed. Such scanners are called ! 1514: .I interactive ! 1515: because if you want to write a scanner for an interactive system such as a ! 1516: command shell, you will probably want the user's input to be terminated ! 1517: with a newline, and without ! 1518: .B -I ! 1519: the user will have to type a character in addition to the newline in order ! 1520: to have the newline recognized. This leads to dreadful interactive ! 1521: performance. ! 1522: .IP ! 1523: If all this seems to confusing, here's the general rule: if a human will ! 1524: be typing in input to your scanner, use ! 1525: .B -I, ! 1526: otherwise don't; if you don't care about squeezing the utmost performance ! 1527: from your scanner and you ! 1528: don't want to make any assumptions about the input to your scanner, ! 1529: use ! 1530: .B -I. ! 1531: .IP ! 1532: Note, ! 1533: .B -I ! 1534: cannot be used in conjunction with ! 1535: .I full ! 1536: or ! 1537: .I fast tables, ! 1538: i.e., the ! 1539: .B -f, -F, -Cf, ! 1540: or ! 1541: .B -CF ! 1542: flags. ! 1543: .TP ! 1544: .B -L ! 1545: instructs ! 1546: .I flex ! 1547: not to generate ! 1548: .B #line ! 1549: directives. Without this option, ! 1550: .I flex ! 1551: peppers the generated scanner ! 1552: with #line directives so error messages in the actions will be correctly ! 1553: located with respect to the original ! 1554: .I flex ! 1555: input file, and not to ! 1556: the fairly meaningless line numbers of ! 1557: .B lex.yy.c. ! 1558: (Unfortunately ! 1559: .I flex ! 1560: does not presently generate the necessary directives ! 1561: to "retarget" the line numbers for those parts of ! 1562: .B lex.yy.c ! 1563: which it generated. So if there is an error in the generated code, ! 1564: a meaningless line number is reported.) ! 1565: .TP ! 1566: .B -T ! 1567: makes ! 1568: .I flex ! 1569: run in ! 1570: .I trace ! 1571: mode. It will generate a lot of messages to ! 1572: .I stdout ! 1573: concerning ! 1574: the form of the input and the resultant non-deterministic and deterministic ! 1575: finite automata. This option is mostly for use in maintaining ! 1576: .I flex. ! 1577: .TP ! 1578: .B -8 ! 1579: instructs ! 1580: .I flex ! 1581: to generate an 8-bit scanner, i.e., one which can recognize 8-bit ! 1582: characters. On some sites, ! 1583: .I flex ! 1584: is installed with this option as the default. On others, the default ! 1585: is 7-bit characters. To see which is the case, check the verbose ! 1586: .B (-v) ! 1587: output for "equivalence classes created". If the denominator of ! 1588: the number shown is 128, then by default ! 1589: .I flex ! 1590: is generating 7-bit characters. If it is 256, then the default is ! 1591: 8-bit characters and the ! 1592: .B -8 ! 1593: flag is not required (but may be a good idea to keep the scanner ! 1594: specification portable). Feeding a 7-bit scanner 8-bit characters ! 1595: will result in infinite loops, bus errors, or other such fireworks, ! 1596: so when in doubt, use the flag. Note that if equivalence classes ! 1597: are used, 8-bit scanners take only slightly more table space than ! 1598: 7-bit scanners (128 bytes, to be exact); if equivalence classes are ! 1599: not used, however, then the tables may grow up to twice their ! 1600: 7-bit size. ! 1601: .TP ! 1602: .B -C[efmF] ! 1603: controls the degree of table compression. ! 1604: .IP ! 1605: .B -Ce ! 1606: directs ! 1607: .I flex ! 1608: to construct ! 1609: .I equivalence classes, ! 1610: i.e., sets of characters ! 1611: which have identical lexical properties (for example, if the only ! 1612: appearance of digits in the ! 1613: .I flex ! 1614: input is in the character class ! 1615: "[0-9]" then the digits '0', '1', ..., '9' will all be put ! 1616: in the same equivalence class). Equivalence classes usually give ! 1617: dramatic reductions in the final table/object file sizes (typically ! 1618: a factor of 2-5) and are pretty cheap performance-wise (one array ! 1619: look-up per character scanned). ! 1620: .IP ! 1621: .B -Cf ! 1622: specifies that the ! 1623: .I full ! 1624: scanner tables should be generated - ! 1625: .I flex ! 1626: should not compress the ! 1627: tables by taking advantages of similar transition functions for ! 1628: different states. ! 1629: .IP ! 1630: .B -CF ! 1631: specifies that the alternate fast scanner representation (described ! 1632: above under the ! 1633: .B -F ! 1634: flag) ! 1635: should be used. ! 1636: .IP ! 1637: .B -Cm ! 1638: directs ! 1639: .I flex ! 1640: to construct ! 1641: .I meta-equivalence classes, ! 1642: which are sets of equivalence classes (or characters, if equivalence ! 1643: classes are not being used) that are commonly used together. Meta-equivalence ! 1644: classes are often a big win when using compressed tables, but they ! 1645: have a moderate performance impact (one or two "if" tests and one ! 1646: array look-up per character scanned). ! 1647: .IP ! 1648: A lone ! 1649: .B -C ! 1650: specifies that the scanner tables should be compressed but neither ! 1651: equivalence classes nor meta-equivalence classes should be used. ! 1652: .IP ! 1653: The options ! 1654: .B -Cf ! 1655: or ! 1656: .B -CF ! 1657: and ! 1658: .B -Cm ! 1659: do not make sense together - there is no opportunity for meta-equivalence ! 1660: classes if the table is not being compressed. Otherwise the options ! 1661: may be freely mixed. ! 1662: .IP ! 1663: The default setting is ! 1664: .B -Cem, ! 1665: which specifies that ! 1666: .I flex ! 1667: should generate equivalence classes ! 1668: and meta-equivalence classes. This setting provides the highest ! 1669: degree of table compression. You can trade off ! 1670: faster-executing scanners at the cost of larger tables with ! 1671: the following generally being true: ! 1672: .nf ! 1673: ! 1674: slowest & smallest ! 1675: -Cem ! 1676: -Cm ! 1677: -Ce ! 1678: -C ! 1679: -C{f,F}e ! 1680: -C{f,F} ! 1681: fastest & largest ! 1682: ! 1683: .fi ! 1684: Note that scanners with the smallest tables are usually generated and ! 1685: compiled the quickest, so ! 1686: during development you will usually want to use the default, maximal ! 1687: compression. ! 1688: .IP ! 1689: .B -Cfe ! 1690: is often a good compromise between speed and size for production ! 1691: scanners. ! 1692: .IP ! 1693: .B -C ! 1694: options are not cumulative; whenever the flag is encountered, the ! 1695: previous -C settings are forgotten. ! 1696: .TP ! 1697: .B -Sskeleton_file ! 1698: overrides the default skeleton file from which ! 1699: .I flex ! 1700: constructs its scanners. You'll never need this option unless you are doing ! 1701: .I flex ! 1702: maintenance or development. ! 1703: .SH PERFORMANCE CONSIDERATIONS ! 1704: The main design goal of ! 1705: .I flex ! 1706: is that it generate high-performance scanners. It has been optimized ! 1707: for dealing well with large sets of rules. Aside from the effects ! 1708: of table compression on scanner speed outlined above, ! 1709: there are a number of options/actions which degrade performance. These ! 1710: are, from most expensive to least: ! 1711: .nf ! 1712: ! 1713: REJECT ! 1714: ! 1715: pattern sets that require backtracking ! 1716: arbitrary trailing context ! 1717: ! 1718: '^' beginning-of-line operator ! 1719: yymore() ! 1720: ! 1721: .fi ! 1722: with the first three all being quite expensive and the last two ! 1723: being quite cheap. ! 1724: .LP ! 1725: .B REJECT ! 1726: should be avoided at all costs when performance is important. ! 1727: It is a particularly expensive option. ! 1728: .LP ! 1729: Getting rid of backtracking is messy and often may be an enormous ! 1730: amount of work for a complicated scanner. In principal, one begins ! 1731: by using the ! 1732: .B -b ! 1733: flag to generate a ! 1734: .I lex.backtrack ! 1735: file. For example, on the input ! 1736: .nf ! 1737: ! 1738: %% ! 1739: foo return TOK_KEYWORD; ! 1740: foobar return TOK_KEYWORD; ! 1741: ! 1742: .fi ! 1743: the file looks like: ! 1744: .nf ! 1745: ! 1746: State #6 is non-accepting - ! 1747: associated rule line numbers: ! 1748: 2 3 ! 1749: out-transitions: [ o ] ! 1750: jam-transitions: EOF [ \\001-n p-\\177 ] ! 1751: ! 1752: State #8 is non-accepting - ! 1753: associated rule line numbers: ! 1754: 3 ! 1755: out-transitions: [ a ] ! 1756: jam-transitions: EOF [ \\001-` b-\\177 ] ! 1757: ! 1758: State #9 is non-accepting - ! 1759: associated rule line numbers: ! 1760: 3 ! 1761: out-transitions: [ r ] ! 1762: jam-transitions: EOF [ \\001-q s-\\177 ] ! 1763: ! 1764: Compressed tables always backtrack. ! 1765: ! 1766: .fi ! 1767: The first few lines tell us that there's a scanner state in ! 1768: which it can make a transition on an 'o' but not on any other ! 1769: character, and that in that state the currently scanned text does not match ! 1770: any rule. The state occurs when trying to match the rules found ! 1771: at lines 2 and 3 in the input file. ! 1772: If the scanner is in that state and then reads ! 1773: something other than an 'o', it will have to backtrack to find ! 1774: a rule which is matched. With ! 1775: a bit of headscratching one can see that this must be the ! 1776: state it's in when it has seen "fo". When this has happened, ! 1777: if anything other than another 'o' is seen, the scanner will ! 1778: have to back up to simply match the 'f' (by the default rule). ! 1779: .LP ! 1780: The comment regarding State #8 indicates there's a problem ! 1781: when "foob" has been scanned. Indeed, on any character other ! 1782: than a 'b', the scanner will have to back up to accept "foo". ! 1783: Similarly, the comment for State #9 concerns when "fooba" has ! 1784: been scanned. ! 1785: .LP ! 1786: The final comment reminds us that there's no point going to ! 1787: all the trouble of removing backtracking from the rules unless ! 1788: we're using ! 1789: .B -f ! 1790: or ! 1791: .B -F, ! 1792: since there's no performance gain doing so with compressed scanners. ! 1793: .LP ! 1794: The way to remove the backtracking is to add "error" rules: ! 1795: .nf ! 1796: ! 1797: %% ! 1798: foo return TOK_KEYWORD; ! 1799: foobar return TOK_KEYWORD; ! 1800: ! 1801: fooba | ! 1802: foob | ! 1803: fo { ! 1804: /* false alarm, not really a keyword */ ! 1805: return TOK_ID; ! 1806: } ! 1807: ! 1808: .fi ! 1809: .LP ! 1810: Eliminating backtracking among a list of keywords can also be ! 1811: done using a "catch-all" rule: ! 1812: .nf ! 1813: ! 1814: %% ! 1815: foo return TOK_KEYWORD; ! 1816: foobar return TOK_KEYWORD; ! 1817: ! 1818: [a-z]+ return TOK_ID; ! 1819: ! 1820: .fi ! 1821: This is usually the best solution when appropriate. ! 1822: .LP ! 1823: Backtracking messages tend to cascade. ! 1824: With a complicated set of rules it's not uncommon to get hundreds ! 1825: of messages. If one can decipher them, though, it often ! 1826: only takes a dozen or so rules to eliminate the backtracking (though ! 1827: it's easy to make a mistake and have an error rule accidentally match ! 1828: a valid token. A possible future ! 1829: .I flex ! 1830: feature will be to automatically add rules to eliminate backtracking). ! 1831: .LP ! 1832: .I Variable ! 1833: trailing context (where both the leading and trailing parts do not have ! 1834: a fixed length) entails almost the same performance loss as ! 1835: .I REJECT ! 1836: (i.e., substantial). So when possible a rule like: ! 1837: .nf ! 1838: ! 1839: %% ! 1840: mouse|rat/(cat|dog) run(); ! 1841: ! 1842: .fi ! 1843: is better written: ! 1844: .nf ! 1845: ! 1846: %% ! 1847: mouse/cat|dog run(); ! 1848: rat/cat|dog run(); ! 1849: ! 1850: .fi ! 1851: or as ! 1852: .nf ! 1853: ! 1854: %% ! 1855: mouse|rat/cat run(); ! 1856: mouse|rat/dog run(); ! 1857: ! 1858: .fi ! 1859: Note that here the special '|' action does ! 1860: .I not ! 1861: provide any savings, and can even make things worse (see ! 1862: .B BUGS ! 1863: in flex(1)). ! 1864: .LP ! 1865: Another area where the user can increase a scanner's performance ! 1866: (and one that's easier to implement) arises from the fact that ! 1867: the longer the tokens matched, the faster the scanner will run. ! 1868: This is because with long tokens the processing of most input ! 1869: characters takes place in the (short) inner scanning loop, and ! 1870: does not often have to go through the additional work of setting up ! 1871: the scanning environment (e.g., ! 1872: .B yytext) ! 1873: for the action. Recall the scanner for C comments: ! 1874: .nf ! 1875: ! 1876: %x comment ! 1877: %% ! 1878: int line_num = 1; ! 1879: ! 1880: "/*" BEGIN(comment); ! 1881: ! 1882: <comment>[^*\\n]* ! 1883: <comment>"*"+[^*/\\n]* ! 1884: <comment>\\n ++line_num; ! 1885: <comment>"*"+"/" BEGIN(INITIAL); ! 1886: ! 1887: .fi ! 1888: This could be sped up by writing it as: ! 1889: .nf ! 1890: ! 1891: %x comment ! 1892: %% ! 1893: int line_num = 1; ! 1894: ! 1895: "/*" BEGIN(comment); ! 1896: ! 1897: <comment>[^*\\n]* ! 1898: <comment>[^*\\n]*\\n ++line_num; ! 1899: <comment>"*"+[^*/\\n]* ! 1900: <comment>"*"+[^*/\\n]*\\n ++line_num; ! 1901: <comment>"*"+"/" BEGIN(INITIAL); ! 1902: ! 1903: .fi ! 1904: Now instead of each newline requiring the processing of another ! 1905: action, recognizing the newlines is "distributed" over the other rules ! 1906: to keep the matched text as long as possible. Note that ! 1907: .I adding ! 1908: rules does ! 1909: .I not ! 1910: slow down the scanner! The speed of the scanner is independent ! 1911: of the number of rules or (modulo the considerations given at the ! 1912: beginning of this section) how complicated the rules are with ! 1913: regard to operators such as '*' and '|'. ! 1914: .LP ! 1915: A final example in speeding up a scanner: suppose you want to scan ! 1916: through a file containing identifiers and keywords, one per line ! 1917: and with no other extraneous characters, and recognize all the ! 1918: keywords. A natural first approach is: ! 1919: .nf ! 1920: ! 1921: %% ! 1922: asm | ! 1923: auto | ! 1924: break | ! 1925: ... etc ... ! 1926: volatile | ! 1927: while /* it's a keyword */ ! 1928: ! 1929: .|\\n /* it's not a keyword */ ! 1930: ! 1931: .fi ! 1932: To eliminate the back-tracking, introduce a catch-all rule: ! 1933: .nf ! 1934: ! 1935: %% ! 1936: asm | ! 1937: auto | ! 1938: break | ! 1939: ... etc ... ! 1940: volatile | ! 1941: while /* it's a keyword */ ! 1942: ! 1943: [a-z]+ | ! 1944: .|\\n /* it's not a keyword */ ! 1945: ! 1946: .fi ! 1947: Now, if it's guaranteed that there's exactly one word per line, ! 1948: then we can reduce the total number of matches by a half by ! 1949: merging in the recognition of newlines with that of the other ! 1950: tokens: ! 1951: .nf ! 1952: ! 1953: %% ! 1954: asm\\n | ! 1955: auto\\n | ! 1956: break\\n | ! 1957: ... etc ... ! 1958: volatile\\n | ! 1959: while\\n /* it's a keyword */ ! 1960: ! 1961: [a-z]+\\n | ! 1962: .|\\n /* it's not a keyword */ ! 1963: ! 1964: .fi ! 1965: One has to be careful here, as we have now reintroduced backtracking ! 1966: into the scanner. In particular, while ! 1967: .I we ! 1968: know that there will never be any characters in the input stream ! 1969: other than letters or newlines, ! 1970: .I flex ! 1971: can't figure this out, and it will plan for possibly needing backtracking ! 1972: when it has scanned a token like "auto" and then the next character ! 1973: is something other than a newline or a letter. Previously it would ! 1974: then just match the "auto" rule and be done, but now it has no "auto" ! 1975: rule, only a "auto\\n" rule. To eliminate the possibility of backtracking, ! 1976: we could either duplicate all rules but without final newlines, or, ! 1977: since we never expect to encounter such an input and therefore don't ! 1978: how it's classified, we can introduce one more catch-all rule, this ! 1979: one which doesn't include a newline: ! 1980: .nf ! 1981: ! 1982: %% ! 1983: asm\\n | ! 1984: auto\\n | ! 1985: break\\n | ! 1986: ... etc ... ! 1987: volatile\\n | ! 1988: while\\n /* it's a keyword */ ! 1989: ! 1990: [a-z]+\\n | ! 1991: [a-z]+ | ! 1992: .|\\n /* it's not a keyword */ ! 1993: ! 1994: .fi ! 1995: Compiled with ! 1996: .B -Cf, ! 1997: this is about as fast as one can get a ! 1998: .I flex ! 1999: scanner to go for this particular problem. ! 2000: .LP ! 2001: A final note: ! 2002: .I flex ! 2003: is slow when matching NUL's, particularly when a token contains ! 2004: multiple NUL's. ! 2005: It's best to write rules which match ! 2006: .I short ! 2007: amounts of text if it's anticipated that the text will often include NUL's. ! 2008: .SH INCOMPATIBILITIES WITH LEX AND POSIX ! 2009: .I flex ! 2010: is a rewrite of the Unix ! 2011: .I lex ! 2012: tool (the two implementations do not share any code, though), ! 2013: with some extensions and incompatibilities, both of which ! 2014: are of concern to those who wish to write scanners acceptable ! 2015: to either implementation. At present, the POSIX ! 2016: .I lex ! 2017: draft is ! 2018: very close to the original ! 2019: .I lex ! 2020: implementation, so some of these ! 2021: incompatibilities are also in conflict with the POSIX draft. But ! 2022: the intent is that except as noted below, ! 2023: .I flex ! 2024: as it presently stands will ! 2025: ultimately be POSIX conformant (i.e., that those areas of conflict with ! 2026: the POSIX draft will be resolved in ! 2027: .I flex's ! 2028: favor). Please bear in ! 2029: mind that all the comments which follow are with regard to the POSIX ! 2030: .I draft ! 2031: standard of Summer 1989, and not the final document (or subsequent ! 2032: drafts); they are included so ! 2033: .I flex ! 2034: users can be aware of the standardization issues and those areas where ! 2035: .I flex ! 2036: may in the near future undergo changes incompatible with ! 2037: its current definition. ! 2038: .LP ! 2039: .I flex ! 2040: is fully compatible with ! 2041: .I lex ! 2042: with the following exceptions: ! 2043: .IP - ! 2044: .I lex ! 2045: does not support exclusive start conditions (%x), though they ! 2046: are in the current POSIX draft. ! 2047: .IP - ! 2048: When definitions are expanded, ! 2049: .I flex ! 2050: encloses them in parentheses. ! 2051: With lex, the following: ! 2052: .nf ! 2053: ! 2054: NAME [A-Z][A-Z0-9]* ! 2055: %% ! 2056: foo{NAME}? printf( "Found it\\n" ); ! 2057: %% ! 2058: ! 2059: .fi ! 2060: will not match the string "foo" because when the macro ! 2061: is expanded the rule is equivalent to "foo[A-Z][A-Z0-9]*?" ! 2062: and the precedence is such that the '?' is associated with ! 2063: "[A-Z0-9]*". With ! 2064: .I flex, ! 2065: the rule will be expanded to ! 2066: "foo([A-Z][A-Z0-9]*)?" and so the string "foo" will match. ! 2067: Note that because of this, the ! 2068: .B ^, $, <s>, /, ! 2069: and ! 2070: .B <<EOF>> ! 2071: operators cannot be used in a ! 2072: .I flex ! 2073: definition. ! 2074: .IP ! 2075: The POSIX draft interpretation is the same as ! 2076: .I flex's. ! 2077: .IP - ! 2078: To specify a character class which matches anything but a left bracket (']'), ! 2079: in ! 2080: .I lex ! 2081: one can use "[^]]" but with ! 2082: .I flex ! 2083: one must use "[^\\]]". The latter works with ! 2084: .I lex, ! 2085: too. ! 2086: .IP - ! 2087: The undocumented ! 2088: .I lex ! 2089: scanner internal variable ! 2090: .B yylineno ! 2091: is not supported. (The variable is not part of the POSIX draft.) ! 2092: .IP - ! 2093: The ! 2094: .B input() ! 2095: routine is not redefinable, though it may be called to read characters ! 2096: following whatever has been matched by a rule. If ! 2097: .B input() ! 2098: encounters an end-of-file the normal ! 2099: .B yywrap() ! 2100: processing is done. A ``real'' end-of-file is returned by ! 2101: .B input() ! 2102: as ! 2103: .I EOF. ! 2104: .IP ! 2105: Input is instead controlled by redefining the ! 2106: .B YY_INPUT ! 2107: macro. ! 2108: .IP ! 2109: The ! 2110: .I flex ! 2111: restriction that ! 2112: .B input() ! 2113: cannot be redefined is in accordance with the POSIX draft, but ! 2114: .B YY_INPUT ! 2115: has not yet been accepted into the draft. ! 2116: .IP - ! 2117: .B output() ! 2118: is not supported. ! 2119: Output from the ! 2120: .B ECHO ! 2121: macro is done to the file-pointer ! 2122: .I yyout ! 2123: (default ! 2124: .I stdout). ! 2125: .IP ! 2126: The POSIX draft mentions that an ! 2127: .B output() ! 2128: routine exists but currently gives no details as to what it does. ! 2129: .IP - ! 2130: The ! 2131: .I lex ! 2132: .B %r ! 2133: (generate a Ratfor scanner) option is not supported. It is not part ! 2134: of the POSIX draft. ! 2135: .IP - ! 2136: If you are providing your own yywrap() routine, you must include a ! 2137: "#undef yywrap" in the definitions section (section 1). Note that ! 2138: the "#undef" will have to be enclosed in %{}'s. ! 2139: .IP ! 2140: The POSIX draft ! 2141: specifies that yywrap() is a function and this is unlikely to change; so ! 2142: .I flex users are warned ! 2143: that ! 2144: .B yywrap() ! 2145: is likely to be changed to a function in the near future. ! 2146: .IP - ! 2147: After a call to ! 2148: .B unput(), ! 2149: .I yytext ! 2150: and ! 2151: .I yyleng ! 2152: are undefined until the next token is matched. This is not the case with ! 2153: .I lex ! 2154: or the present POSIX draft. ! 2155: .IP - ! 2156: The precedence of the ! 2157: .B {} ! 2158: (numeric range) operator is different. ! 2159: .I lex ! 2160: interprets "abc{1,3}" as "match one, two, or ! 2161: three occurrences of 'abc'", whereas ! 2162: .I flex ! 2163: interprets it as "match 'ab' ! 2164: followed by one, two, or three occurrences of 'c'". The latter is ! 2165: in agreement with the current POSIX draft. ! 2166: .IP - ! 2167: The precedence of the ! 2168: .B ^ ! 2169: operator is different. ! 2170: .I lex ! 2171: interprets "^foo|bar" as "match either 'foo' at the beginning of a line, ! 2172: or 'bar' anywhere", whereas ! 2173: .I flex ! 2174: interprets it as "match either 'foo' or 'bar' if they come at the beginning ! 2175: of a line". The latter is in agreement with the current POSIX draft. ! 2176: .IP - ! 2177: To refer to yytext outside of the scanner source file, ! 2178: the correct definition with ! 2179: .I flex ! 2180: is "extern char *yytext" rather than "extern char yytext[]". ! 2181: This is contrary to the current POSIX draft but a point on which ! 2182: .I flex ! 2183: will not be changing, as the array representation entails a ! 2184: serious performance penalty. It is hoped that the POSIX draft will ! 2185: be emended to support the ! 2186: .I flex ! 2187: variety of declaration (as this is a fairly painless change to ! 2188: require of ! 2189: .I lex ! 2190: users). ! 2191: .IP - ! 2192: .I yyin ! 2193: is ! 2194: .I initialized ! 2195: by ! 2196: .I lex ! 2197: to be ! 2198: .I stdin; ! 2199: .I flex, ! 2200: on the other hand, ! 2201: initializes ! 2202: .I yyin ! 2203: to NULL ! 2204: and then ! 2205: .I assigns ! 2206: it to ! 2207: .I stdin ! 2208: the first time the scanner is called, providing ! 2209: .I yyin ! 2210: has not already been assigned to a non-NULL value. The difference is ! 2211: subtle, but the net effect is that with ! 2212: .I flex ! 2213: scanners, ! 2214: .I yyin ! 2215: does not have a valid value until the scanner has been called. ! 2216: .IP - ! 2217: The special table-size declarations such as ! 2218: .B %a ! 2219: supported by ! 2220: .I lex ! 2221: are not required by ! 2222: .I flex ! 2223: scanners; ! 2224: .I flex ! 2225: ignores them. ! 2226: .IP - ! 2227: The name ! 2228: .bd ! 2229: FLEX_SCANNER ! 2230: is #define'd so scanners may be written for use with either ! 2231: .I flex ! 2232: or ! 2233: .I lex. ! 2234: .LP ! 2235: The following ! 2236: .I flex ! 2237: features are not included in ! 2238: .I lex ! 2239: or the POSIX draft standard: ! 2240: .nf ! 2241: ! 2242: yyterminate() ! 2243: <<EOF>> ! 2244: YY_DECL ! 2245: #line directives ! 2246: %{}'s around actions ! 2247: yyrestart() ! 2248: comments beginning with '#' (deprecated) ! 2249: multiple actions on a line ! 2250: ! 2251: .fi ! 2252: This last feature refers to the fact that with ! 2253: .I flex ! 2254: you can put multiple actions on the same line, separated with ! 2255: semi-colons, while with ! 2256: .I lex, ! 2257: the following ! 2258: .nf ! 2259: ! 2260: foo handle_foo(); ++num_foos_seen; ! 2261: ! 2262: .fi ! 2263: is (rather surprisingly) truncated to ! 2264: .nf ! 2265: ! 2266: foo handle_foo(); ! 2267: ! 2268: .fi ! 2269: .I flex ! 2270: does not truncate the action. Actions that are not enclosed in ! 2271: braces are simply terminated at the end of the line. ! 2272: .SH DIAGNOSTICS ! 2273: .I reject_used_but_not_detected undefined ! 2274: or ! 2275: .I yymore_used_but_not_detected undefined - ! 2276: These errors can occur at compile time. They indicate that the ! 2277: scanner uses ! 2278: .B REJECT ! 2279: or ! 2280: .B yymore() ! 2281: but that ! 2282: .I flex ! 2283: failed to notice the fact, meaning that ! 2284: .I flex ! 2285: scanned the first two sections looking for occurrences of these actions ! 2286: and failed to find any, but somehow you snuck some in (via a #include ! 2287: file, for example). Make an explicit reference to the action in your ! 2288: .I flex ! 2289: input file. (Note that previously ! 2290: .I flex ! 2291: supported a ! 2292: .B %used/%unused ! 2293: mechanism for dealing with this problem; this feature is still supported ! 2294: but now deprecated, and will go away soon unless the author hears from ! 2295: people who can argue compellingly that they need it.) ! 2296: .LP ! 2297: .I flex scanner jammed - ! 2298: a scanner compiled with ! 2299: .B -s ! 2300: has encountered an input string which wasn't matched by ! 2301: any of its rules. ! 2302: .LP ! 2303: .I flex input buffer overflowed - ! 2304: a scanner rule matched a string long enough to overflow the ! 2305: scanner's internal input buffer (16K bytes by default - controlled by ! 2306: .B YY_BUF_SIZE ! 2307: in "flex.skel". Note that to redefine this macro, you must first ! 2308: .B #undefine ! 2309: it). ! 2310: .LP ! 2311: .I scanner requires -8 flag - ! 2312: Your scanner specification includes recognizing 8-bit characters and ! 2313: you did not specify the -8 flag (and your site has not installed flex ! 2314: with -8 as the default). ! 2315: .LP ! 2316: .I too many %t classes! - ! 2317: You managed to put every single character into its own %t class. ! 2318: .I flex ! 2319: requires that at least one of the classes share characters. ! 2320: .SH DEFICIENCIES / BUGS ! 2321: See flex(1). ! 2322: .SH "SEE ALSO" ! 2323: .LP ! 2324: flex(1), lex(1), yacc(1), sed(1), awk(1). ! 2325: .LP ! 2326: M. E. Lesk and E. Schmidt, ! 2327: .I LEX - Lexical Analyzer Generator ! 2328: .SH AUTHOR ! 2329: Vern Paxson, with the help of many ideas and much inspiration from ! 2330: Van Jacobson. Original version by Jef Poskanzer. The fast table ! 2331: representation is a partial implementation of a design done by Van ! 2332: Jacobson. The implementation was done by Kevin Gong and Vern Paxson. ! 2333: .LP ! 2334: Thanks to the many ! 2335: .I flex ! 2336: beta-testers, feedbackers, and contributors, especially Casey ! 2337: Leedom, [email protected], ! 2338: Frederic Brehm, Nick Christopher, Jason Coughlin, ! 2339: Scott David Daniels, Leo Eskin, ! 2340: Chris Faylor, Eric Goldman, Eric ! 2341: Hughes, Jeffrey R. Jones, Kevin B. Kenny, Ronald Lamprecht, ! 2342: Greg Lee, Craig Leres, Mohamed el Lozy, Jim Meyering, Marc Nozell, Esmond Pitt, ! 2343: Jef Poskanzer, Jim Roskind, ! 2344: Dave Tallman, Frank Whaley, Ken Yap, and those whose names ! 2345: have slipped my marginal mail-archiving skills but whose contributions ! 2346: are appreciated all the same. ! 2347: .LP ! 2348: Thanks to Keith Bostic, John Gilmore, Craig Leres, Bob ! 2349: Mulcahy, Rich Salz, and Richard Stallman for help with various distribution ! 2350: headaches. ! 2351: .LP ! 2352: Thanks to Esmond Pitt and Earle Horton for 8-bit character support; ! 2353: to Benson Margulies and Fred ! 2354: Burke for C++ support; to Ove Ewerlid for the basics of support for ! 2355: NUL's; and to Eric Hughes for the basics of support for multiple buffers. ! 2356: .LP ! 2357: Work is being done on extending ! 2358: .I flex ! 2359: to generate scanners in which the ! 2360: state machine is directly represented in C code rather than tables. ! 2361: These scanners may well be substantially faster than those generated ! 2362: using -f or -F. If you are working in this area and are interested ! 2363: in comparing notes and seeing whether redundant work can be avoided, ! 2364: contact Ove Ewerlid ([email protected]). ! 2365: .LP ! 2366: This work was primarily done when I was at the Real Time Systems Group ! 2367: at the Lawrence Berkeley Laboratory in Berkeley, CA. Many thanks to all there ! 2368: for the support I received. ! 2369: .LP ! 2370: Send comments to: ! 2371: .nf ! 2372: ! 2373: Vern Paxson ! 2374: Computer Science Department ! 2375: 4126 Upson Hall ! 2376: Cornell University ! 2377: Ithaca, NY 14853-7501 ! 2378: ! 2379: [email protected] ! 2380: decvax!cornell!vern ! 2381: ! 2382: .fi
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.