|
|
1.1 ! root 1: .\" @(#)lex.ms 6.2 (Berkeley) 5/1/87 ! 2: .\" ! 3: .EH 'PS1:16-%''Lex \- A Lexical Analyzer Generator' ! 4: .OH 'Lex \- A Lexical Analyzer Generator''PS1:16-%' ! 5: .hc ~ ! 6: .bd I 2 ! 7: .de TS ! 8: .br ! 9: .nf ! 10: .SP 1v ! 11: .ul 0 ! 12: .. ! 13: .de TE ! 14: .SP 1v ! 15: .fi ! 16: .. ! 17: .\".de PT ! 18: .\".if \\n%>1 'tl ''\s7LEX\s0\s9\(mi%\s0'' ! 19: .\".if \\n%>1 'sp ! 20: .\".. ! 21: .ND July 21, 1975 ! 22: .\".RP ! 23: .\".TM 75-1274-15 39199 39199-11 ! 24: .TL ! 25: Lex \- A Lexical Analyzer ~Generator~ ! 26: .AU ``MH 2C-569'' 6377 ! 27: M. E. Lesk and E. Schmidt ! 28: .AI ! 29: .MH ! 30: .AB ! 31: .sp ! 32: .bd I 2 ! 33: .\".nr PS 8 ! 34: .\".nr VS 9 ! 35: .\".ps 8 ! 36: .\".vs 9p ! 37: Lex helps write programs whose control flow ! 38: is directed by instances of regular ! 39: expressions in the input stream. ! 40: It is well suited for editor-script type transformations and ! 41: for segmenting input in preparation for ! 42: a parsing routine. ! 43: .PP ! 44: Lex source is a table of regular expressions and corresponding program fragments. ! 45: The table is translated to a program ! 46: which reads an input stream, copying it to an output stream ! 47: and partitioning the input ! 48: into strings which match the given expressions. ! 49: As each such string is recognized the corresponding ! 50: program fragment is executed. ! 51: The recognition of the expressions ! 52: is performed by a deterministic finite automaton ! 53: generated by Lex. ! 54: The program fragments written by the user are executed in the order in which the ! 55: corresponding regular expressions occur in the input stream. ! 56: .if n .if \n(tm .ig ! 57: .PP ! 58: The lexical analysis ! 59: programs written with Lex accept ambiguous specifications ! 60: and choose the longest ! 61: match possible at each input point. ! 62: If necessary, substantial look~ahead ! 63: is performed on the input, but the ! 64: input stream will be backed up to the ! 65: end of the current partition, so that the user ! 66: has general freedom to manipulate it. ! 67: .PP ! 68: Lex can generate analyzers in either C or Ratfor, a language ! 69: which can be translated automatically to portable Fortran. ! 70: It is available on the PDP-11 UNIX, Honeywell GCOS, ! 71: and IBM OS systems. ! 72: This manual, however, will only discuss generating analyzers ! 73: in C on the UNIX system, which is the only supported ! 74: form of Lex under UNIX Version 7. ! 75: Lex is designed to simplify ! 76: interfacing with Yacc, for those ! 77: with access to this compiler-compiler system. ! 78: .. ! 79: .\".nr PS 9 ! 80: .\".nr VS 11 ! 81: .AE ! 82: .2C ! 83: .NH ! 84: Introduction. ! 85: .PP ! 86: Lex is a program generator designed for ! 87: lexical processing of character input streams. ! 88: It accepts a high-level, problem oriented specification ! 89: for character string matching, ! 90: and ! 91: produces a program in a general purpose language which recognizes ! 92: regular expressions. ! 93: The regular expressions are specified by the user in the ! 94: source specifications given to Lex. ! 95: The Lex written code recognizes these expressions ! 96: in an input stream and partitions the input stream into ! 97: strings matching the expressions. At the bound~aries ! 98: between strings ! 99: program sections ! 100: provided by the user are executed. ! 101: The Lex source file associates the regular expressions and the ! 102: program fragments. ! 103: As each expression appears in the input to the program written by Lex, ! 104: the corresponding fragment is executed. ! 105: .PP ! 106: .de MH ! 107: Bell Laboratories, Murray Hill, NJ 07974. ! 108: .. ! 109: The user supplies the additional code ! 110: beyond expression matching ! 111: needed to complete his tasks, possibly ! 112: including code written by other generators. ! 113: The program that recognizes the expressions is generated in the ! 114: general purpose programming language employed for the ! 115: user's program fragments. ! 116: Thus, a high level expression ! 117: language is provided to write the string expressions to be ! 118: matched while the user's freedom to write actions ! 119: is unimpaired. ! 120: This avoids forcing the user who wishes to use a string manipulation ! 121: language for input analysis to write processing programs in the same ! 122: and often inappropriate string handling language. ! 123: .PP ! 124: Lex is not a complete language, but rather a generator representing ! 125: a new language feature which can be added to ! 126: different programming languages, called ``host languages.'' ! 127: Just as general purpose languages ! 128: can produce code to run on different computer hardware, ! 129: Lex can write code in different host languages. ! 130: The host language is used for the output code generated by Lex ! 131: and also for the program fragments added by the user. ! 132: Compatible run-time libraries for the different host languages ! 133: are also provided. ! 134: This makes Lex adaptable to different environments and ! 135: different users. ! 136: Each application ! 137: may be directed to the combination of hardware and host language appropriate ! 138: to the task, the user's background, and the properties of local ! 139: implementations. ! 140: At present, the only supported host language is C, ! 141: although Fortran (in the form of Ratfor [2] has been available ! 142: in the past. ! 143: Lex itself exists on UNIX, GCOS, and OS/370; but the ! 144: code generated by Lex may be taken anywhere the appropriate ! 145: compilers exist. ! 146: .PP ! 147: Lex turns the user's expressions and actions ! 148: (called ! 149: .ul ! 150: source ! 151: in this memo) into the host general-purpose language; ! 152: the generated program is named ! 153: .ul ! 154: yylex. ! 155: The ! 156: .ul ! 157: yylex ! 158: program ! 159: will recognize expressions ! 160: in a stream ! 161: (called ! 162: .ul ! 163: input ! 164: in this memo) ! 165: and perform the specified actions for each expression as it is detected. ! 166: See Figure 1. ! 167: .GS ! 168: .TS ! 169: center; ! 170: l _ r ! 171: l|c|r ! 172: l _ r ! 173: l _ r ! 174: l|c|r ! 175: l _ r ! 176: c s s ! 177: c s s. ! 178: ! 179: Source \(-> Lex \(-> yylex ! 180: ! 181: .sp 2 ! 182: ! 183: Input \(-> yylex \(-> Output ! 184: ! 185: .sp ! 186: An overview of Lex ! 187: Figure 1 ! 188: .TE ! 189: .GE ! 190: .PP ! 191: For a trivial example, consider a program to delete ! 192: from the input ! 193: all blanks or tabs at the ends of lines. ! 194: .TS ! 195: center; ! 196: l l. ! 197: %% ! 198: [ \et]+$ ; ! 199: .TE ! 200: is all that is required. ! 201: The program ! 202: contains a %% delimiter to mark the beginning of the rules, and ! 203: one rule. ! 204: This rule contains a regular expression ! 205: which matches one or more ! 206: instances of the characters blank or tab ! 207: (written \et for visibility, in accordance with the C language convention) ! 208: just prior to the end of a line. ! 209: The brackets indicate the character ! 210: class made of blank and tab; the + indicates ``one or more ...''; ! 211: and the $ indicates ``end of line,'' as in QED. ! 212: No action is specified, ! 213: so the program generated by Lex (yylex) will ignore these characters. ! 214: Everything else will be copied. ! 215: To change any remaining ! 216: string of blanks or tabs to a single blank, ! 217: add another rule: ! 218: .TS ! 219: center; ! 220: l l. ! 221: %% ! 222: [ \et]+$ ; ! 223: [ \et]+ printf(" "); ! 224: .TE ! 225: The finite automaton generated for this ! 226: source will scan for both rules at once, ! 227: observing at ! 228: the termination of the string of blanks or tabs ! 229: whether or not there is a newline character, and executing ! 230: the desired rule action. ! 231: The first rule matches all strings of blanks or tabs ! 232: at the end of lines, and the second ! 233: rule all remaining strings of blanks or tabs. ! 234: .PP ! 235: Lex can be used alone for simple transformations, or ! 236: for analysis and statistics gathering on a lexical level. ! 237: Lex can also be used with a parser generator ! 238: to perform the lexical analysis phase; it is particularly ! 239: easy to interface Lex and Yacc [3]. ! 240: Lex programs recognize only regular expressions; ! 241: Yacc writes parsers that accept a large class of context free grammars, ! 242: but require a lower level analyzer to recognize input tokens. ! 243: Thus, a combination of Lex and Yacc is often appropriate. ! 244: When used as a preprocessor for a later parser generator, ! 245: Lex is used to partition the input stream, ! 246: and the parser generator assigns structure to ! 247: the resulting pieces. ! 248: The flow of control ! 249: in such a case (which might be the first half of a compiler, ! 250: for example) is shown in Figure 2. ! 251: Additional programs, ! 252: written by other generators ! 253: or by hand, can ! 254: be added easily to programs written by Lex. ! 255: .BS 2 ! 256: .ps 9 ! 257: .vs 11 ! 258: .TS ! 259: center; ! 260: l c c c l ! 261: l c c c l ! 262: l c c c l ! 263: l _ c _ l ! 264: l|c|c|c|l ! 265: l _ c _ l ! 266: l c c c l ! 267: l _ c _ l ! 268: l|c|c|c|l ! 269: l _ c _ l ! 270: l c s s l ! 271: l c s s l. ! 272: lexical grammar ! 273: rules rules ! 274: \(da \(da ! 275: ! 276: Lex Yacc ! 277: ! 278: \(da \(da ! 279: ! 280: Input \(-> yylex \(-> yyparse \(-> Parsed input ! 281: ! 282: .sp ! 283: Lex with Yacc ! 284: Figure 2 ! 285: .TE ! 286: .ps 10 ! 287: .vs 12 ! 288: .BE ! 289: Yacc users ! 290: will realize that the name ! 291: .ul ! 292: yylex ! 293: is what Yacc expects its lexical analyzer to be named, ! 294: so that the use of this name by Lex simplifies ! 295: interfacing. ! 296: .PP ! 297: Lex generates a deterministic finite automaton from the regular expressions ! 298: in the source [4]. ! 299: The automaton is interpreted, rather than compiled, in order ! 300: to save space. ! 301: The result is still a fast analyzer. ! 302: In particular, the time taken by a Lex program ! 303: to recognize and partition an input stream is ! 304: proportional to the length of the input. ! 305: The number of Lex rules or ! 306: the complexity of the rules is ! 307: not important in determining speed, ! 308: unless rules which include ! 309: forward context require a significant amount of re~scanning. ! 310: What does increase with the number and complexity of rules ! 311: is the size of the finite ! 312: automaton, and therefore the size of the program ! 313: generated by Lex. ! 314: .PP ! 315: In the program written by Lex, the user's fragments ! 316: (representing the ! 317: .ul ! 318: actions ! 319: to be performed as each regular expression ! 320: is found) ! 321: are gathered ! 322: as cases of a switch. ! 323: The automaton interpreter directs the control flow. ! 324: Opportunity is provided for the user to insert either ! 325: declarations or additional statements in the routine containing ! 326: the actions, or to ! 327: add subroutines outside this action routine. ! 328: .PP ! 329: Lex is not limited to source which can ! 330: be interpreted on the basis of one character ! 331: look~ahead. ! 332: For example, ! 333: if there are two rules, one looking for ! 334: .I ab ! 335: and another for ! 336: .I abcdefg , ! 337: and the input stream is ! 338: .I abcdefh , ! 339: Lex will recognize ! 340: .I ab ! 341: and leave ! 342: the input pointer just before ! 343: .I "cd. . ." ! 344: Such backup is more costly ! 345: than the processing of simpler languages. ! 346: .2C ! 347: .NH ! 348: Lex Source. ! 349: .PP ! 350: The general format of Lex source is: ! 351: .TS ! 352: center; ! 353: l. ! 354: {definitions} ! 355: %% ! 356: {rules} ! 357: %% ! 358: {user subroutines} ! 359: .TE ! 360: where the definitions and the user subroutines ! 361: are often omitted. ! 362: The second ! 363: .I %% ! 364: is optional, but the first is required ! 365: to mark the beginning of the rules. ! 366: The absolute minimum Lex program is thus ! 367: .TS ! 368: center; ! 369: l. ! 370: %% ! 371: .TE ! 372: (no definitions, no rules) which translates into a program ! 373: which copies the input to the output unchanged. ! 374: .PP ! 375: In the outline of Lex programs shown above, the ! 376: .I ! 377: rules ! 378: .R ! 379: represent the user's control ! 380: decisions; they are a table, in which the left column ! 381: contains ! 382: .I ! 383: regular expressions ! 384: .R ! 385: (see section 3) ! 386: and the right column contains ! 387: .I ! 388: actions, ! 389: .R ! 390: program fragments to be executed when the expressions ! 391: are recognized. ! 392: Thus an individual rule might appear ! 393: .TS ! 394: center; ! 395: l l. ! 396: integer printf("found keyword INT"); ! 397: .TE ! 398: to look for the string ! 399: .I integer ! 400: in the input stream and ! 401: print the message ``found keyword INT'' whenever it appears. ! 402: In this example the host procedural language is C and ! 403: the C library function ! 404: .I ! 405: printf ! 406: .R ! 407: is used to print the string. ! 408: The end ! 409: of the expression is indicated by the first blank or tab character. ! 410: If the action is merely a single C expression, ! 411: it can just be given on the right side of the line; if it is ! 412: compound, or takes more than a line, it should be enclosed in ! 413: braces. ! 414: As a slightly more useful example, suppose it is desired to ! 415: change a number of words from British to American spelling. ! 416: Lex rules such as ! 417: .TS ! 418: center; ! 419: l l. ! 420: colour printf("color"); ! 421: mechanise printf("mechanize"); ! 422: petrol printf("gas"); ! 423: .TE ! 424: would be a start. These rules are not quite enough, ! 425: since ! 426: the word ! 427: .I petroleum ! 428: would become ! 429: .I gaseum ; ! 430: a way of dealing ! 431: with this will be described later. ! 432: .2C ! 433: .NH ! 434: Lex Regular Expressions. ! 435: .PP ! 436: The definitions of regular expressions are very similar to those ! 437: in QED [5]. ! 438: A regular ! 439: expression specifies a set of strings to be matched. ! 440: It contains text characters (which match the corresponding ! 441: characters in the strings being compared) ! 442: and operator characters (which specify ! 443: repetitions, choices, and other features). ! 444: The letters of the alphabet and the digits are ! 445: always text characters; thus the regular expression ! 446: .TS ! 447: center; ! 448: l l. ! 449: integer ! 450: .TE ! 451: matches the string ! 452: .ul ! 453: integer ! 454: wherever it appears ! 455: and the expression ! 456: .TS ! 457: center; ! 458: l. ! 459: a57D ! 460: .TE ! 461: looks for the string ! 462: .ul ! 463: a57D. ! 464: .PP ! 465: .I ! 466: Operators. ! 467: .R ! 468: The operator characters are ! 469: .TS ! 470: center; ! 471: l. ! 472: " \e [ ] ^ \- ? . \(** + | ( ) $ / { } % < > ! 473: .TE ! 474: and if they are to be used as text characters, an escape ! 475: should be used. ! 476: The quotation mark operator (") ! 477: indicates that whatever is contained between a pair of quotes ! 478: is to be taken as text characters. ! 479: Thus ! 480: .TS ! 481: center; ! 482: l. ! 483: xyz"++" ! 484: .TE ! 485: matches the string ! 486: .I xyz++ ! 487: when it appears. Note that a part of a string may be quoted. ! 488: It is harmless but unnecessary to quote an ordinary ! 489: text character; the expression ! 490: .TS ! 491: center; ! 492: l. ! 493: "xyz++" ! 494: .TE ! 495: is the same as the one above. ! 496: Thus by quoting every non-alphanumeric character ! 497: being used as a text character, the user can avoid remembering ! 498: the list above of current ! 499: operator characters, and is safe should further extensions to Lex ! 500: lengthen the list. ! 501: .PP ! 502: An operator character may also be turned into a text character ! 503: by preceding it with \e as in ! 504: .TS ! 505: center; ! 506: l. ! 507: xyz\e+\e+ ! 508: .TE ! 509: which ! 510: is another, less readable, equivalent of the above expressions. ! 511: Another use of the quoting mechanism is to get a blank into ! 512: an expression; normally, as explained above, blanks or tabs end ! 513: a rule. ! 514: Any blank character not contained within [\|] (see below) must ! 515: be quoted. ! 516: Several normal C escapes with \e ! 517: are recognized: \en is newline, \et is tab, and \eb is backspace. ! 518: To enter \e itself, use \e\e. ! 519: Since newline is illegal in an expression, \en must be used; ! 520: it is not ! 521: required to escape tab and backspace. ! 522: Every character but blank, tab, newline and the list above is always ! 523: a text character. ! 524: .PP ! 525: .I ! 526: Character classes. ! 527: .R ! 528: Classes of characters can be specified using the operator pair [\|]. ! 529: The construction ! 530: .I [abc] ! 531: matches a ! 532: single character, which may be ! 533: .I a , ! 534: .I b , ! 535: or ! 536: .I c . ! 537: Within square brackets, ! 538: most operator meanings are ignored. ! 539: Only three characters are special: ! 540: these are \e \(mi and ^. The \(mi character ! 541: indicates ranges. For example, ! 542: .TS ! 543: center; ! 544: l. ! 545: [a\(miz0\(mi9<>_] ! 546: .TE ! 547: indicates the character class containing all the lower case letters, ! 548: the digits, ! 549: the angle brackets, and underline. ! 550: Ranges may be given in either order. ! 551: Using \(mi between any pair of characters which are ! 552: not both upper case letters, both lower case letters, or both digits ! 553: is implementation dependent and will get a warning message. ! 554: (E.g., [0\-z] in ASCII is many more characters ! 555: than it is in EBCDIC). ! 556: If it is desired to include the ! 557: character \(mi in a character class, it should be first or ! 558: last; thus ! 559: .TS ! 560: center; ! 561: l. ! 562: [\(mi+0\(mi9] ! 563: .TE ! 564: matches all the digits and the two signs. ! 565: .PP ! 566: In character classes, ! 567: the ^ operator must appear as the first character ! 568: after the left bracket; it indicates that the resulting string ! 569: is to be complemented with respect to the computer character set. ! 570: Thus ! 571: .TS ! 572: center; ! 573: l. ! 574: [^abc] ! 575: .TE ! 576: matches all characters except a, b, or c, including ! 577: all special or control characters; or ! 578: .TS ! 579: center; ! 580: l. ! 581: [^a\-zA\-Z] ! 582: .TE ! 583: is any character which is not a letter. ! 584: The \e character provides the usual escapes within ! 585: character class brackets. ! 586: .PP ! 587: .I ! 588: Arbitrary character. ! 589: .R ! 590: To match almost any character, the operator character ! 591: .TS ! 592: center; ! 593: l. ! 594: \&. ! 595: .TE ! 596: is the class of all characters except newline. ! 597: Escaping into octal is possible although non-portable: ! 598: .TS ! 599: center; ! 600: l. ! 601: [\e40\-\e176] ! 602: .TE ! 603: matches all printable characters in the ASCII character set, from octal ! 604: 40 (blank) to octal 176 (tilde). ! 605: .PP ! 606: .I ! 607: Optional expressions. ! 608: .R ! 609: The operator ! 610: .I ? ! 611: indicates ! 612: an optional element of an expression. ! 613: Thus ! 614: .TS ! 615: center; ! 616: l. ! 617: ab?c ! 618: .TE ! 619: matches either ! 620: .I ac ! 621: or ! 622: .I abc . ! 623: .PP ! 624: .I ! 625: Repeated expressions. ! 626: .R ! 627: Repetitions of classes are indicated by the operators ! 628: .I \(** ! 629: and ! 630: .I + . ! 631: .TS ! 632: center; ! 633: l. ! 634: \f2a\(**\f1 ! 635: .TE ! 636: is any number of consecutive ! 637: .I a ! 638: characters, including zero; while ! 639: .TS ! 640: center; ! 641: l. ! 642: a+ ! 643: .TE ! 644: is one or more instances of ! 645: .I a. ! 646: For example, ! 647: .TS ! 648: center; ! 649: l. ! 650: [a\-z]+ ! 651: .TE ! 652: is all strings of lower case letters. ! 653: And ! 654: .TS ! 655: center; ! 656: l. ! 657: [A\(miZa\(miz][A\(miZa\(miz0\(mi9]\(** ! 658: .TE ! 659: indicates all alphanumeric strings with a leading ! 660: alphabetic character. ! 661: This is a typical expression for recognizing identifiers in ! 662: computer languages. ! 663: .PP ! 664: .I ! 665: Alternation and Grouping. ! 666: .R ! 667: The operator | ! 668: indicates alternation: ! 669: .TS ! 670: center; ! 671: l. ! 672: (ab\||\|cd) ! 673: .TE ! 674: matches either ! 675: .ul ! 676: ab ! 677: or ! 678: .ul ! 679: cd. ! 680: Note that parentheses are used for grouping, although ! 681: they are ! 682: not necessary on the outside level; ! 683: .TS ! 684: center; ! 685: l. ! 686: ab\||\|cd ! 687: .TE ! 688: would have sufficed. ! 689: Parentheses ! 690: can be used for more complex expressions: ! 691: .TS ! 692: center; ! 693: l. ! 694: (ab\||\|cd+)?(ef)\(** ! 695: .TE ! 696: matches such strings as ! 697: .I abefef , ! 698: .I efefef , ! 699: .I cdef , ! 700: or ! 701: .I cddd\| ; ! 702: but not ! 703: .I abc , ! 704: .I abcd , ! 705: or ! 706: .I abcdef . ! 707: .PP ! 708: .I ! 709: Context sensitivity. ! 710: .R ! 711: Lex will recognize a small amount of surrounding ! 712: context. The two simplest operators for this are ! 713: .I ^ ! 714: and ! 715: .I $ . ! 716: If the first character of an expression is ! 717: .I ^ , ! 718: the expression will only be matched at the beginning ! 719: of a line (after a newline character, or at the beginning of ! 720: the input stream). ! 721: This can never conflict with the other meaning of ! 722: .I ^ , ! 723: complementation ! 724: of character classes, since that only applies within ! 725: the [\|] operators. ! 726: If the very last character is ! 727: .I $ , ! 728: the expression will only be matched at the end of a line (when ! 729: immediately followed by newline). ! 730: The latter operator is a special case of the ! 731: .I / ! 732: operator character, ! 733: which indicates trailing context. ! 734: The expression ! 735: .TS ! 736: center; ! 737: l. ! 738: ab/cd ! 739: .TE ! 740: matches the string ! 741: .I ab , ! 742: but only if followed by ! 743: .ul ! 744: cd. ! 745: Thus ! 746: .TS ! 747: center; ! 748: l. ! 749: ab$ ! 750: .TE ! 751: is the same as ! 752: .TS ! 753: center; ! 754: l. ! 755: ab/\en ! 756: .TE ! 757: Left context is handled in Lex by ! 758: .I ! 759: start conditions ! 760: .R ! 761: as explained in section 10. If a rule is only to be executed ! 762: when the Lex automaton interpreter is in start condition ! 763: .I ! 764: x, ! 765: .R ! 766: the rule should be prefixed by ! 767: .TS ! 768: center; ! 769: l. ! 770: <x> ! 771: .TE ! 772: using the angle bracket operator characters. ! 773: If we considered ``being at the beginning of a line'' to be ! 774: start condition ! 775: .I ONE , ! 776: then the ^ operator ! 777: would be equivalent to ! 778: .TS ! 779: center; ! 780: l. ! 781: <ONE> ! 782: .TE ! 783: Start conditions are explained more fully later. ! 784: .PP ! 785: .I ! 786: Repetitions and Definitions. ! 787: .R ! 788: The operators {} specify ! 789: either repetitions (if they enclose numbers) ! 790: or ! 791: definition expansion (if they enclose a name). For example ! 792: .TS ! 793: center; ! 794: l. ! 795: {digit} ! 796: .TE ! 797: looks for a predefined string named ! 798: .I digit ! 799: and inserts it ! 800: at that point in the expression. ! 801: The definitions are given in the first part of the Lex ! 802: input, before the rules. ! 803: In contrast, ! 804: .TS ! 805: center; ! 806: l. ! 807: a{1,5} ! 808: .TE ! 809: looks for 1 to 5 occurrences of ! 810: .I a . ! 811: .PP ! 812: Finally, initial ! 813: .I % ! 814: is special, being the separator ! 815: for Lex source segments. ! 816: .2C ! 817: .NH ! 818: Lex Actions. ! 819: .PP ! 820: When an expression written as above is matched, Lex ! 821: executes the corresponding action. This section describes ! 822: some features of Lex which aid in writing actions. Note ! 823: that there is a default action, which ! 824: consists of copying the input to the output. This ! 825: is performed on all strings not otherwise matched. Thus ! 826: the Lex user who wishes to absorb the entire input, without ! 827: producing any output, must provide rules to match everything. ! 828: When Lex is being used with Yacc, this is the normal ! 829: situation. ! 830: One may consider that actions are what is done instead of ! 831: copying the input to the output; thus, in general, ! 832: a rule which merely copies can be omitted. ! 833: Also, a character combination ! 834: which is omitted from the rules ! 835: and which appears as input ! 836: is likely to be printed on the output, thus calling ! 837: attention to the gap in the rules. ! 838: .PP ! 839: One of the simplest things that can be done is to ignore ! 840: the input. Specifying a C null statement, \fI;\fR as an action ! 841: causes this result. A frequent rule is ! 842: .TS ! 843: center; ! 844: l l. ! 845: [ \et\en] ; ! 846: .TE ! 847: which causes the three spacing characters (blank, tab, and newline) ! 848: to be ignored. ! 849: .PP ! 850: Another easy way to avoid writing actions is the action character ! 851: |, which indicates that the action for this rule is the action ! 852: for the next rule. ! 853: The previous example could also have been written ! 854: .TS ! 855: center; ! 856: l l. ! 857: " " | ! 858: "\et" | ! 859: "\en" ; ! 860: .TE ! 861: with the same result, although in different style. ! 862: The quotes around \en and \et are not required. ! 863: .PP ! 864: In more complex actions, the user ! 865: will ! 866: often want to know the actual text that matched some expression ! 867: like ! 868: .I [a\(miz]+ . ! 869: Lex leaves this text in an external character ! 870: array named ! 871: .I ! 872: yytext. ! 873: .R ! 874: Thus, to print the name found, ! 875: a rule like ! 876: .TS ! 877: center; ! 878: l l. ! 879: [a\-z]+ printf("%s", yytext); ! 880: .TE ! 881: will print ! 882: the string in ! 883: .I ! 884: yytext. ! 885: .R ! 886: The C function ! 887: .I ! 888: printf ! 889: .R ! 890: accepts a format argument and data to be printed; ! 891: in this case, the format is ``print string'' (% indicating ! 892: data conversion, and ! 893: .I s ! 894: indicating string type), ! 895: and the data are the characters ! 896: in ! 897: .I ! 898: yytext. ! 899: .R ! 900: So this just places ! 901: the matched string ! 902: on the output. ! 903: This action ! 904: is so common that ! 905: it may be written as ECHO: ! 906: .TS ! 907: center; ! 908: l l. ! 909: [a\-z]+ ECHO; ! 910: .TE ! 911: is the same as the above. ! 912: Since the default action is just to ! 913: print the characters found, one might ask why ! 914: give a rule, like this one, which merely specifies ! 915: the default action? ! 916: Such rules are often required ! 917: to avoid matching some other rule ! 918: which is not desired. For example, if there is a rule ! 919: which matches ! 920: .I read ! 921: it will normally match the instances of ! 922: .I read ! 923: contained in ! 924: .I bread ! 925: or ! 926: .I readjust ; ! 927: to avoid ! 928: this, ! 929: a rule ! 930: of the form ! 931: .I [a\(miz]+ ! 932: is needed. ! 933: This is explained further below. ! 934: .PP ! 935: Sometimes it is more convenient to know the end of what ! 936: has been found; hence Lex also provides a count ! 937: .I ! 938: yyleng ! 939: .R ! 940: of the number of characters matched. ! 941: To count both the number ! 942: of words and the number of characters in words in the input, the user might write ! 943: .TS ! 944: center; ! 945: l l. ! 946: [a\-zA\-Z]+ {words++; chars += yyleng;} ! 947: .TE ! 948: which accumulates in ! 949: .ul ! 950: chars ! 951: the number ! 952: of characters in the words recognized. ! 953: The last character in the string matched can ! 954: be accessed by ! 955: .TS ! 956: center; ! 957: l. ! 958: yytext[yyleng\-1] ! 959: .TE ! 960: .PP ! 961: Occasionally, a Lex ! 962: action may decide that a rule has not recognized the correct ! 963: span of characters. ! 964: Two routines are provided to aid with this situation. ! 965: First, ! 966: .I ! 967: yymore() ! 968: .R ! 969: can be called to indicate that the next input expression recognized is to be ! 970: tacked on to the end of this input. Normally, ! 971: the next input string would overwrite the current ! 972: entry in ! 973: .I ! 974: yytext. ! 975: .R ! 976: Second, ! 977: .I ! 978: yyless (n) ! 979: .R ! 980: may be called to indicate that not all the characters matched ! 981: by the currently successful expression are wanted right now. ! 982: The argument ! 983: .I ! 984: n ! 985: .R ! 986: indicates the number of characters ! 987: in ! 988: .I ! 989: yytext ! 990: .R ! 991: to be retained. ! 992: Further characters previously matched ! 993: are ! 994: returned to the input. This provides the same sort of ! 995: look~ahead offered by the / operator, ! 996: but in a different form. ! 997: .PP ! 998: .I ! 999: Example: ! 1000: .R ! 1001: Consider a language which defines ! 1002: a string as a set of characters between quotation (") marks, and provides that ! 1003: to include a " in a string it must be preceded by a \e. The ! 1004: regular expression which matches that is somewhat confusing, ! 1005: so that it might be preferable to write ! 1006: .TS ! 1007: center; ! 1008: l l. ! 1009: \e"[^"]\(** { ! 1010: if (yytext[yyleng\-1] == \(fm\e\e\(fm) ! 1011: yymore(); ! 1012: else ! 1013: ... normal user processing ! 1014: } ! 1015: .TE ! 1016: which will, when faced with a string such as ! 1017: .I ! 1018: "abc\e"def\|" ! 1019: .R ! 1020: first match ! 1021: the five characters ! 1022: \fI"abc\e\|\fR; ! 1023: then ! 1024: the call to ! 1025: .I yymore() ! 1026: will ! 1027: cause the next part of the string, ! 1028: \fI"def\|\fR, ! 1029: to be tacked on the end. ! 1030: Note that the final quote terminating the string should be picked ! 1031: up in the code labeled ``normal processing''. ! 1032: .PP ! 1033: The function ! 1034: .I ! 1035: yyless() ! 1036: .R ! 1037: might be used to reprocess ! 1038: text in various circumstances. Consider the C problem of distinguishing ! 1039: the ambiguity of ``=\(mia''. ! 1040: Suppose it is desired to treat this as ``=\(mi a'' ! 1041: but print a message. A rule might be ! 1042: .ps 9 ! 1043: .vs 11 ! 1044: .TS ! 1045: center; ! 1046: l l. ! 1047: =\(mi[a\-zA\-Z] { ! 1048: printf("Op (=\(mi) ambiguous\en"); ! 1049: yyless(yyleng\-1); ! 1050: ... action for =\(mi ... ! 1051: } ! 1052: .TE ! 1053: .ps 10 ! 1054: .vs 12 ! 1055: which prints a message, returns the letter after the ! 1056: operator to the input stream, and treats the operator as ``=\(mi''. ! 1057: Alternatively it might be desired to treat this as ``= \(mia''. ! 1058: To do this, just return the minus ! 1059: sign as well as the letter to the input: ! 1060: .ps 9 ! 1061: .vs 11 ! 1062: .TS ! 1063: center; ! 1064: l l. ! 1065: =\(mi[a\-zA\-Z] { ! 1066: printf("Op (=\(mi) ambiguous\en"); ! 1067: yyless(yyleng\-2); ! 1068: ... action for = ... ! 1069: } ! 1070: .TE ! 1071: .ps 10 ! 1072: .vs 12 ! 1073: will perform the other interpretation. ! 1074: Note that the expressions for the two cases might more easily ! 1075: be written ! 1076: .TS ! 1077: center; ! 1078: l l. ! 1079: =\(mi/[A\-Za\-z] ! 1080: .TE ! 1081: in the first case and ! 1082: .TS ! 1083: center; ! 1084: l. ! 1085: =/\-[A\-Za\-z] ! 1086: .TE ! 1087: in the second; ! 1088: no backup would be required in the rule action. ! 1089: It is not necessary to recognize the whole identifier ! 1090: to observe the ambiguity. ! 1091: The ! 1092: possibility of ``=\(mi3'', however, makes ! 1093: .TS ! 1094: center; ! 1095: l. ! 1096: =\(mi/[^ \et\en] ! 1097: .TE ! 1098: a still better rule. ! 1099: .PP ! 1100: In addition to these routines, Lex also permits ! 1101: access to the I/O routines ! 1102: it uses. ! 1103: They are: ! 1104: .IP 1) ! 1105: .I ! 1106: input() ! 1107: .R ! 1108: which returns the next input character; ! 1109: .IP 2) ! 1110: .I ! 1111: output(c) ! 1112: .R ! 1113: which writes the character ! 1114: .I ! 1115: c ! 1116: .R ! 1117: on the output; and ! 1118: .IP 3) ! 1119: .I ! 1120: unput(c) ! 1121: .R ! 1122: pushes the character ! 1123: .I ! 1124: c ! 1125: .R ! 1126: back onto the input stream to be read later by ! 1127: .I ! 1128: input(). ! 1129: .R ! 1130: .LP ! 1131: By default these routines are provided as macro definitions, ! 1132: but the user can override them and supply private versions. ! 1133: These routines ! 1134: define the relationship between external files and ! 1135: internal characters, and must all be retained ! 1136: or modified consistently. ! 1137: They may be redefined, to ! 1138: cause input or output to be transmitted to or from strange ! 1139: places, including other programs or internal memory; ! 1140: but the character set used must be consistent in all routines; ! 1141: a value of zero returned by ! 1142: .I ! 1143: input ! 1144: .R ! 1145: must mean end of file; and ! 1146: the relationship between ! 1147: .I ! 1148: unput ! 1149: .R ! 1150: and ! 1151: .I ! 1152: input ! 1153: .R ! 1154: must be retained ! 1155: or the Lex look~ahead will not work. ! 1156: Lex does not look ahead at all if it does not have to, ! 1157: but every rule ending in ! 1158: .ft I ! 1159: + \(** ? ! 1160: .ft R ! 1161: or ! 1162: .ft I ! 1163: $ ! 1164: .ft R ! 1165: or containing ! 1166: .ft I ! 1167: / ! 1168: .ft R ! 1169: implies look~ahead. ! 1170: Look~ahead is also necessary to match an expression that is a prefix ! 1171: of another expression. ! 1172: See below for a discussion of the character set used by Lex. ! 1173: The standard Lex library imposes ! 1174: a 100 character limit on backup. ! 1175: .PP ! 1176: Another Lex library routine that the user will sometimes want ! 1177: to redefine is ! 1178: .I ! 1179: yywrap() ! 1180: .R ! 1181: which is called whenever Lex reaches an end-of-file. ! 1182: If ! 1183: .I ! 1184: yywrap ! 1185: .R ! 1186: returns a 1, Lex continues with the normal wrapup on end of input. ! 1187: Sometimes, however, it is convenient to arrange for more ! 1188: input to arrive ! 1189: from a new source. ! 1190: In this case, the user should provide ! 1191: a ! 1192: .I ! 1193: yywrap ! 1194: .R ! 1195: which ! 1196: arranges for new input and ! 1197: returns 0. This instructs Lex to continue processing. ! 1198: The default ! 1199: .I ! 1200: yywrap ! 1201: .R ! 1202: always returns 1. ! 1203: .PP ! 1204: This routine is also a convenient place ! 1205: to print tables, summaries, etc. at the end ! 1206: of a program. Note that it is not ! 1207: possible to write a normal rule which recognizes ! 1208: end-of-file; the only access to this condition is ! 1209: through ! 1210: .I ! 1211: yywrap. ! 1212: .R ! 1213: In fact, unless a private version of ! 1214: .I ! 1215: input() ! 1216: .R ! 1217: is supplied ! 1218: a file containing nulls ! 1219: cannot be handled, ! 1220: since a value of 0 returned by ! 1221: .I ! 1222: input ! 1223: .R ! 1224: is taken to be end-of-file. ! 1225: .PP ! 1226: .2C ! 1227: .NH ! 1228: Ambiguous Source Rules. ! 1229: .PP ! 1230: Lex can handle ambiguous specifications. ! 1231: When more than one expression can match the ! 1232: current input, Lex chooses as follows: ! 1233: .IP 1) ! 1234: The longest match is preferred. ! 1235: .IP 2) ! 1236: Among rules which matched the same number of characters, ! 1237: the rule given first is preferred. ! 1238: .LP ! 1239: Thus, suppose the rules ! 1240: .TS ! 1241: center; ! 1242: l l. ! 1243: integer keyword action ...; ! 1244: [a\-z]+ identifier action ...; ! 1245: .TE ! 1246: to be given in that order. If the input is ! 1247: .I integers , ! 1248: it is taken as an identifier, because ! 1249: .I [a\-z]+ ! 1250: matches 8 characters while ! 1251: .I integer ! 1252: matches only 7. ! 1253: If the input is ! 1254: .I integer , ! 1255: both rules match 7 characters, and ! 1256: the keyword rule is selected because it was given first. ! 1257: Anything shorter (e.g. \fIint\fR\|) will ! 1258: not match the expression ! 1259: .I integer ! 1260: and so the identifier interpretation is used. ! 1261: .PP ! 1262: The principle of preferring the longest ! 1263: match makes rules containing ! 1264: expressions like ! 1265: .I \&.\(** ! 1266: dangerous. ! 1267: For example, ! 1268: .TS ! 1269: center; ! 1270: l. ! 1271: \&\(fm.\(**\(fm ! 1272: .TE ! 1273: might seem a good way of recognizing ! 1274: a string in single quotes. ! 1275: But it is an invitation for the program to read far ! 1276: ahead, looking for a distant ! 1277: single quote. ! 1278: Presented with the input ! 1279: .TS ! 1280: center; ! 1281: l l. ! 1282: \&\(fmfirst\(fm quoted string here, \(fmsecond\(fm here ! 1283: .TE ! 1284: the above expression will match ! 1285: .TS ! 1286: center; ! 1287: l l. ! 1288: \&\(fmfirst\(fm quoted string here, \(fmsecond\(fm ! 1289: .TE ! 1290: which is probably not what was wanted. ! 1291: A better rule is of the form ! 1292: .TS ! 1293: center; ! 1294: l. ! 1295: \&\(fm[^\(fm\en]\(**\(fm ! 1296: .TE ! 1297: which, on the above input, will stop ! 1298: after ! 1299: .I \(fmfirst\(fm . ! 1300: The consequences ! 1301: of errors like this are mitigated by the fact ! 1302: that the ! 1303: .I \&. ! 1304: operator will not match newline. ! 1305: Thus expressions like ! 1306: .I \&.\(** ! 1307: stop on the ! 1308: current line. ! 1309: Don't try to defeat this with expressions like ! 1310: .I (.|\en)+ ! 1311: or ! 1312: equivalents; ! 1313: the Lex generated program will try to read ! 1314: the entire input file, causing ! 1315: internal buffer overflows. ! 1316: .PP ! 1317: Note that Lex is normally partitioning ! 1318: the input stream, not searching for all possible matches ! 1319: of each expression. ! 1320: This means that each character is accounted for ! 1321: once and only once. ! 1322: For example, suppose it is desired to ! 1323: count occurrences of both \fIshe\fR and \fIhe\fR in an input text. ! 1324: Some Lex rules to do this might be ! 1325: .TS ! 1326: center; ! 1327: l l. ! 1328: she s++; ! 1329: he h++; ! 1330: \en | ! 1331: \&. ; ! 1332: .TE ! 1333: where the last two rules ignore everything besides \fIhe\fR and \fIshe\fR. ! 1334: Remember that . does not include newline. ! 1335: Since \fIshe\fR includes \fIhe\fR, Lex will normally ! 1336: .I ! 1337: not ! 1338: .R ! 1339: recognize ! 1340: the instances of \fIhe\fR included in \fIshe\fR, ! 1341: since once it has passed a \fIshe\fR those characters are gone. ! 1342: .PP ! 1343: Sometimes the user would like to override this choice. The action ! 1344: REJECT ! 1345: means ``go do the next alternative.'' ! 1346: It causes whatever rule was second choice after the current ! 1347: rule to be executed. ! 1348: The position of the input pointer is adjusted accordingly. ! 1349: Suppose the user really wants to count the included instances of \fIhe\fR: ! 1350: .TS ! 1351: center; ! 1352: l l. ! 1353: she {s++; REJECT;} ! 1354: he {h++; REJECT;} ! 1355: \en | ! 1356: \&. ; ! 1357: .TE ! 1358: these rules are one way of changing the previous example ! 1359: to do just that. ! 1360: After counting each expression, it is rejected; whenever appropriate, ! 1361: the other expression will then be counted. In this example, of course, ! 1362: the user could note that \fIshe\fR includes \fIhe\fR but not ! 1363: vice versa, and omit the REJECT action on \fIhe\fR; ! 1364: in other cases, however, it ! 1365: would not be possible a priori to tell ! 1366: which input characters ! 1367: were in both classes. ! 1368: .PP ! 1369: Consider the two rules ! 1370: .TS ! 1371: center; ! 1372: l l. ! 1373: a[bc]+ { ... ; REJECT;} ! 1374: a[cd]+ { ... ; REJECT;} ! 1375: .TE ! 1376: If the input is ! 1377: .I ab , ! 1378: only the first rule matches, ! 1379: and on ! 1380: .I ad ! 1381: only the second matches. ! 1382: The input string ! 1383: .I accb ! 1384: matches the first rule for four characters ! 1385: and then the second rule for three characters. ! 1386: In contrast, the input ! 1387: .I accd ! 1388: agrees with ! 1389: the second rule for four characters and then the first ! 1390: rule for three. ! 1391: .PP ! 1392: In general, REJECT is useful whenever ! 1393: the purpose of Lex is not to partition the input ! 1394: stream but to detect all examples of some items ! 1395: in the input, and the instances of these items ! 1396: may overlap or include each other. ! 1397: Suppose a digram table of the input is desired; ! 1398: normally the digrams overlap, that is the word ! 1399: .I the ! 1400: is considered to contain ! 1401: both ! 1402: .I th ! 1403: and ! 1404: .I he . ! 1405: Assuming a two-dimensional array named ! 1406: .ul ! 1407: digram ! 1408: to be incremented, the appropriate ! 1409: source is ! 1410: .TS ! 1411: center; ! 1412: l l. ! 1413: %% ! 1414: [a\-z][a\-z] { ! 1415: digram[yytext[0]][yytext[1]]++; ! 1416: REJECT; ! 1417: } ! 1418: \. ; ! 1419: \en ; ! 1420: .TE ! 1421: where the REJECT is necessary to pick up ! 1422: a letter pair beginning at every character, rather than at every ! 1423: other character. ! 1424: .2C ! 1425: .NH ! 1426: Lex Source Definitions. ! 1427: .PP ! 1428: Remember the format of the Lex ! 1429: source: ! 1430: .TS ! 1431: center; ! 1432: l. ! 1433: {definitions} ! 1434: %% ! 1435: {rules} ! 1436: %% ! 1437: {user routines} ! 1438: .TE ! 1439: So far only the rules have been described. The user needs ! 1440: additional options, ! 1441: though, to define variables for use in his program and for use ! 1442: by Lex. ! 1443: These can go either in the definitions section ! 1444: or in the rules section. ! 1445: .PP ! 1446: Remember that Lex is turning the rules into a program. ! 1447: Any source not intercepted by Lex is copied ! 1448: into the generated program. There are three classes ! 1449: of such things. ! 1450: .IP 1) ! 1451: Any line which is not part of a Lex rule or action ! 1452: which begins with a blank or tab is copied into ! 1453: the Lex generated program. ! 1454: Such source input prior to the first %% delimiter will be external ! 1455: to any function in the code; if it appears immediately after the first ! 1456: %%, ! 1457: it appears in an appropriate place for declarations ! 1458: in the function written by Lex which contains the actions. ! 1459: This material must look like program fragments, ! 1460: and should precede the first Lex rule. ! 1461: .IP ! 1462: As a side effect of the above, lines which begin with a blank ! 1463: or tab, and which contain a comment, ! 1464: are passed through to the generated program. ! 1465: This can be used to include comments in either the Lex source or ! 1466: the generated code. The comments should follow the host ! 1467: language convention. ! 1468: .IP 2) ! 1469: Anything included between lines containing ! 1470: only ! 1471: .I %{ ! 1472: and ! 1473: .I %} ! 1474: is ! 1475: copied out as above. The delimiters are discarded. ! 1476: This format permits entering text like preprocessor statements that ! 1477: must begin in column 1, ! 1478: or copying lines that do not look like programs. ! 1479: .IP 3) ! 1480: Anything after the third %% delimiter, regardless of formats, etc., ! 1481: is copied out after the Lex output. ! 1482: .PP ! 1483: Definitions intended for Lex are given ! 1484: before the first %% delimiter. Any line in this section ! 1485: not contained between %{ and %}, and begining ! 1486: in column 1, is assumed to define Lex substitution strings. ! 1487: The format of such lines is ! 1488: .TS ! 1489: center; ! 1490: l l. ! 1491: name translation ! 1492: .TE ! 1493: and it ! 1494: causes the string given as a translation to ! 1495: be associated with the name. ! 1496: The name and translation ! 1497: must be separated by at least one blank or tab, and the name must begin with a letter. ! 1498: The translation can then be called out ! 1499: by the {name} syntax in a rule. ! 1500: Using {D} for the digits and {E} for an exponent field, ! 1501: for example, might abbreviate rules to recognize numbers: ! 1502: .TS ! 1503: center; ! 1504: l l. ! 1505: D [0\-9] ! 1506: E [DEde][\-+]?{D}+ ! 1507: %% ! 1508: {D}+ printf("integer"); ! 1509: {D}+"."{D}\(**({E})? | ! 1510: {D}\(**"."{D}+({E})? | ! 1511: {D}+{E} printf("real"); ! 1512: .TE ! 1513: Note the first two rules for real numbers; ! 1514: both require a decimal point and contain ! 1515: an optional exponent field, ! 1516: but the first requires at least one digit before the ! 1517: decimal point and the second requires at least one ! 1518: digit after the decimal point. ! 1519: To correctly handle the problem ! 1520: posed by a Fortran expression such as ! 1521: .I 35.EQ.I , ! 1522: which does not contain a real number, a context-sensitive ! 1523: rule such as ! 1524: .TS ! 1525: center; ! 1526: l l. ! 1527: [0\-9]+/"."EQ printf("integer"); ! 1528: .TE ! 1529: could be used in addition to the normal rule for integers. ! 1530: .PP ! 1531: The definitions ! 1532: section may also contain other commands, including the ! 1533: selection of a host language, a character set table, ! 1534: a list of start conditions, or adjustments to the default ! 1535: size of arrays within Lex itself for larger source programs. ! 1536: These possibilities ! 1537: are discussed below under ``Summary of Source Format,'' ! 1538: section 12. ! 1539: .2C ! 1540: .NH ! 1541: Usage. ! 1542: .PP ! 1543: There are two steps in ! 1544: compiling a Lex source program. ! 1545: First, the Lex source must be turned into a generated program ! 1546: in the host general purpose language. ! 1547: Then this program must be compiled and loaded, usually with ! 1548: a library of Lex subroutines. ! 1549: The generated program ! 1550: is on a file named ! 1551: .I lex.yy.c . ! 1552: The I/O library is defined in terms of the C standard ! 1553: library [6]. ! 1554: .PP ! 1555: The C programs generated by Lex are slightly different ! 1556: on OS/370, because the ! 1557: OS compiler is less powerful than the UNIX or GCOS compilers, ! 1558: and does less at compile time. ! 1559: C programs generated on GCOS and UNIX are the same. ! 1560: .PP ! 1561: .I ! 1562: UNIX. ! 1563: .R ! 1564: The library is accessed by the loader flag ! 1565: .I \-ll . ! 1566: So an appropriate ! 1567: set of commands is ! 1568: .KS ! 1569: .in 5 ! 1570: lex source ! 1571: cc lex.yy.c \-ll ! 1572: .in 0 ! 1573: .KE ! 1574: The resulting program is placed on the usual file ! 1575: .I ! 1576: a.out ! 1577: .R ! 1578: for later execution. ! 1579: To use Lex with Yacc see below. ! 1580: Although the default Lex I/O routines use the C standard library, ! 1581: the Lex automata themselves do not do so; ! 1582: if private versions of ! 1583: .I ! 1584: input, ! 1585: output ! 1586: .R ! 1587: and ! 1588: .I unput ! 1589: are given, the library can be avoided. ! 1590: .PP ! 1591: .2C ! 1592: .NH ! 1593: Lex and Yacc. ! 1594: .PP ! 1595: If you want to use Lex with Yacc, note that what Lex writes is a program ! 1596: named ! 1597: .I ! 1598: yylex(), ! 1599: .R ! 1600: the name required by Yacc for its analyzer. ! 1601: Normally, the default main program on the Lex library ! 1602: calls this routine, but if Yacc is loaded, and its main ! 1603: program is used, Yacc will call ! 1604: .I ! 1605: yylex(). ! 1606: .R ! 1607: In this case each Lex rule should end with ! 1608: .TS ! 1609: center; ! 1610: l. ! 1611: return(token); ! 1612: .TE ! 1613: where the appropriate token value is returned. ! 1614: An easy way to get access ! 1615: to Yacc's names for tokens is to ! 1616: compile the Lex output file as part of ! 1617: the Yacc output file by placing the line ! 1618: .TS ! 1619: center; ! 1620: l. ! 1621: # include "lex.yy.c" ! 1622: .TE ! 1623: in the last section of Yacc input. ! 1624: Supposing the grammar to be ! 1625: named ``good'' and the lexical rules to be named ``better'' ! 1626: the UNIX command sequence can just be: ! 1627: .TS ! 1628: center; ! 1629: l. ! 1630: yacc good ! 1631: lex better ! 1632: cc y.tab.c \-ly \-ll ! 1633: .TE ! 1634: The Yacc library (\-ly) should be loaded before the Lex library, ! 1635: to obtain a main program which invokes the Yacc parser. ! 1636: The generations of Lex and Yacc programs can be done in ! 1637: either order. ! 1638: .2C ! 1639: .NH ! 1640: Examples. ! 1641: .PP ! 1642: As a trivial problem, consider copying an input file while ! 1643: adding 3 to every positive number divisible by 7. ! 1644: Here is a suitable Lex source program ! 1645: .TS ! 1646: center; ! 1647: l l. ! 1648: %% ! 1649: int k; ! 1650: [0\-9]+ { ! 1651: k = atoi(yytext); ! 1652: if (k%7 == 0) ! 1653: printf("%d", k+3); ! 1654: else ! 1655: printf("%d",k); ! 1656: } ! 1657: .TE ! 1658: to do just that. ! 1659: The rule [0\-9]+ recognizes strings of digits; ! 1660: .I ! 1661: atoi ! 1662: .R ! 1663: converts the digits to binary ! 1664: and stores the result in ! 1665: .ul ! 1666: k. ! 1667: The operator % (remainder) is used to check whether ! 1668: .ul ! 1669: k ! 1670: is divisible by 7; if it is, ! 1671: it is incremented by 3 as it is written out. ! 1672: It may be objected that this program will alter such ! 1673: input items as ! 1674: .I 49.63 ! 1675: or ! 1676: .I X7 . ! 1677: Furthermore, it increments the absolute value ! 1678: of all negative numbers divisible by 7. ! 1679: To avoid this, just add a few more rules after the active one, ! 1680: as here: ! 1681: .TS ! 1682: center; ! 1683: l l. ! 1684: %% ! 1685: int k; ! 1686: \-?[0\-9]+ { ! 1687: k = atoi(yytext); ! 1688: printf("%d", ! 1689: k%7 == 0 ? k+3 : k); ! 1690: } ! 1691: \-?[0\-9.]+ ECHO; ! 1692: [A-Za-z][A-Za-z0-9]+ ECHO; ! 1693: .TE ! 1694: Numerical strings containing ! 1695: a ``.'' or preceded by a letter will be picked up by ! 1696: one of the last two rules, and not changed. ! 1697: The ! 1698: .I if\-else ! 1699: has been replaced by ! 1700: a C conditional expression to save space; ! 1701: the form ! 1702: .ul ! 1703: a?b:c ! 1704: means ``if ! 1705: .I a ! 1706: then ! 1707: .I b ! 1708: else ! 1709: .I c ''. ! 1710: .PP ! 1711: For an example of statistics gathering, here ! 1712: is a program which histograms the lengths ! 1713: of words, where a word is defined as a string of letters. ! 1714: .TS ! 1715: center; ! 1716: l l. ! 1717: int lengs[100]; ! 1718: %% ! 1719: [a\-z]+ lengs[yyleng]++; ! 1720: \&. | ! 1721: \en ; ! 1722: %% ! 1723: .T& ! 1724: l s. ! 1725: yywrap() ! 1726: { ! 1727: int i; ! 1728: printf("Length No. words\en"); ! 1729: for(i=0; i<100; i++) ! 1730: if (lengs[i] > 0) ! 1731: printf("%5d%10d\en",i,lengs[i]); ! 1732: return(1); ! 1733: } ! 1734: .TE ! 1735: This program ! 1736: accumulates the histogram, while producing no output. At the end ! 1737: of the input it prints the table. ! 1738: The final statement ! 1739: .I ! 1740: return(1); ! 1741: .R ! 1742: indicates that Lex is to perform wrapup. If ! 1743: .I ! 1744: yywrap ! 1745: .R ! 1746: returns zero (false) ! 1747: it implies that further input is available ! 1748: and the program is ! 1749: to continue reading and processing. ! 1750: To provide a ! 1751: .I ! 1752: yywrap ! 1753: .R ! 1754: that never ! 1755: returns true causes an infinite loop. ! 1756: .PP ! 1757: As a larger example, ! 1758: here are some parts of a program written by N. L. Schryer ! 1759: to convert double precision Fortran to single precision Fortran. ! 1760: Because Fortran does not distinguish upper and lower case letters, ! 1761: this routine begins by defining a set of classes including ! 1762: both cases of each letter: ! 1763: .TS ! 1764: center; ! 1765: l l. ! 1766: a [aA] ! 1767: b [bB] ! 1768: c [cC] ! 1769: \&... ! 1770: z [zZ] ! 1771: .TE ! 1772: An additional class recognizes white space: ! 1773: .TS ! 1774: center; ! 1775: l l. ! 1776: W [ \et]\(** ! 1777: .TE ! 1778: The first rule changes ! 1779: ``double precision'' to ``real'', or ``DOUBLE PRECISION'' to ``REAL''. ! 1780: .TS ! 1781: center; ! 1782: l. ! 1783: {d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n} { ! 1784: printf(yytext[0]==\(fmd\(fm? "real" : "REAL"); ! 1785: } ! 1786: .TE ! 1787: Care is taken throughout this program to preserve the case ! 1788: (upper or lower) ! 1789: of the original program. ! 1790: The conditional operator is used to ! 1791: select the proper form of the keyword. ! 1792: The next rule copies continuation card indications to ! 1793: avoid confusing them with constants: ! 1794: .TS ! 1795: center; ! 1796: l l. ! 1797: ^" "[^ 0] ECHO; ! 1798: .TE ! 1799: In the regular expression, the quotes surround the ! 1800: blanks. ! 1801: It is interpreted as ! 1802: ``beginning of line, then five blanks, then ! 1803: anything but blank or zero.'' ! 1804: Note the two different meanings of ! 1805: .I ^ . ! 1806: There follow some rules to change double precision ! 1807: constants to ordinary floating constants. ! 1808: .TS ! 1809: center; ! 1810: l. ! 1811: [0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+ | ! 1812: [0\-9]+{W}"."{W}{d}{W}[+\-]?{W}[0\-9]+ | ! 1813: "."{W}[0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+ { ! 1814: /\(** convert constants \(**/ ! 1815: for(p=yytext; \(**p != 0; p++) ! 1816: { ! 1817: if (\(**p == \(fmd\(fm || \(**p == \(fmD\(fm) ! 1818: \(**p=+ \(fme\(fm\- \(fmd\(fm; ! 1819: ECHO; ! 1820: } ! 1821: .TE ! 1822: After the floating point constant is recognized, it is ! 1823: scanned by the ! 1824: .ul ! 1825: for ! 1826: loop ! 1827: to find the letter ! 1828: .I d ! 1829: or ! 1830: .I D . ! 1831: The program than adds ! 1832: .c ! 1833: .I \(fme\(fm\-\(fmd\(fm , ! 1834: which converts ! 1835: it to the next letter of the alphabet. ! 1836: The modified constant, now single-precision, ! 1837: is written out again. ! 1838: There follow a series of names which must be respelled to remove ! 1839: their initial \fId\fR. ! 1840: By using the ! 1841: array ! 1842: .I ! 1843: yytext ! 1844: .R ! 1845: the same action suffices for all the names (only a sample of ! 1846: a rather long list is given here). ! 1847: .TS ! 1848: center; ! 1849: l l. ! 1850: {d}{s}{i}{n} | ! 1851: {d}{c}{o}{s} | ! 1852: {d}{s}{q}{r}{t} | ! 1853: {d}{a}{t}{a}{n} | ! 1854: \&... ! 1855: {d}{f}{l}{o}{a}{t} printf("%s",yytext+1); ! 1856: .TE ! 1857: Another list of names must have initial \fId\fR changed to initial \fIa\fR: ! 1858: .TS ! 1859: center; ! 1860: l l. ! 1861: {d}{l}{o}{g} | ! 1862: {d}{l}{o}{g}10 | ! 1863: {d}{m}{i}{n}1 | ! 1864: {d}{m}{a}{x}1 { ! 1865: yytext[0] =+ \(fma\(fm \- \(fmd\(fm; ! 1866: ECHO; ! 1867: } ! 1868: .TE ! 1869: And one routine ! 1870: must have initial \fId\fR changed to initial \fIr\fR: ! 1871: .TS ! 1872: center; ! 1873: l l. ! 1874: {d}1{m}{a}{c}{h} {yytext[0] =+ \(fmr\(fm \- \(fmd\(fm; ! 1875: ECHO; ! 1876: } ! 1877: .TE ! 1878: To avoid such names as \fIdsinx\fR being detected as instances ! 1879: of \fIdsin\fR, some final rules pick up longer words as identifiers ! 1880: and copy some surviving characters: ! 1881: .TS ! 1882: center; ! 1883: l l. ! 1884: [A\-Za\-z][A\-Za\-z0\-9]\(** | ! 1885: [0\-9]+ | ! 1886: \en | ! 1887: \&. ECHO; ! 1888: .TE ! 1889: Note that this program is not complete; it ! 1890: does not deal with the spacing problems in Fortran or ! 1891: with the use of keywords as identifiers. ! 1892: .br ! 1893: .2C ! 1894: .NH ! 1895: Left Context Sensitivity. ! 1896: .PP ! 1897: Sometimes ! 1898: it is desirable to have several sets of lexical rules ! 1899: to be applied at different times in the input. ! 1900: For example, a compiler preprocessor might distinguish ! 1901: preprocessor statements and analyze them differently ! 1902: from ordinary statements. ! 1903: This requires ! 1904: sensitivity ! 1905: to prior context, and there are several ways of handling ! 1906: such problems. ! 1907: The \fI^\fR operator, for example, is a prior context operator, ! 1908: recognizing immediately preceding left context just as \fI$\fR recognizes ! 1909: immediately following right context. ! 1910: Adjacent left context could be extended, to produce a facility similar to ! 1911: that for adjacent right context, but it is unlikely ! 1912: to be as useful, since often the relevant left context ! 1913: appeared some time earlier, such as at the beginning of a line. ! 1914: .PP ! 1915: This section describes three means of dealing ! 1916: with different environments: a simple use of flags, ! 1917: when only a few rules change from one environment to another, ! 1918: the use of ! 1919: .I ! 1920: start conditions ! 1921: .R ! 1922: on rules, ! 1923: and the possibility of making multiple lexical analyzers all run ! 1924: together. ! 1925: In each case, there are rules which recognize the need to change the ! 1926: environment in which the ! 1927: following input text is analyzed, and set some parameter ! 1928: to reflect the change. This may be a flag explicitly tested by ! 1929: the user's action code; such a flag is the simplest way of dealing ! 1930: with the problem, since Lex is not involved at all. ! 1931: It may be more convenient, ! 1932: however, ! 1933: to have Lex remember the flags as initial conditions on the rules. ! 1934: Any rule may be associated with a start condition. It will only ! 1935: be recognized when Lex is in ! 1936: that start condition. ! 1937: The current start condition may be changed at any time. ! 1938: Finally, if the sets of rules for the different environments ! 1939: are very dissimilar, ! 1940: clarity may be best achieved by writing several distinct lexical ! 1941: analyzers, and switching from one to another as desired. ! 1942: .PP ! 1943: Consider the following problem: copy the input to the output, ! 1944: changing the word \fImagic\fR to \fIfirst\fR on every line which began ! 1945: with the letter \fIa\fR, changing \fImagic\fR to \fIsecond\fR on every line ! 1946: which began with the letter \fIb\fR, and changing ! 1947: \fImagic\fR to \fIthird\fR on every line which began ! 1948: with the letter \fIc\fR. All other words and all other lines ! 1949: are left unchanged. ! 1950: .PP ! 1951: These rules are so simple that the easiest way ! 1952: to do this job is with a flag: ! 1953: .TS ! 1954: center; ! 1955: l l. ! 1956: int flag; ! 1957: %% ! 1958: ^a {flag = \(fma\(fm; ECHO;} ! 1959: ^b {flag = \(fmb\(fm; ECHO;} ! 1960: ^c {flag = \(fmc\(fm; ECHO;} ! 1961: \en {flag = 0 ; ECHO;} ! 1962: magic { ! 1963: switch (flag) ! 1964: { ! 1965: case \(fma\(fm: printf("first"); break; ! 1966: case \(fmb\(fm: printf("second"); break; ! 1967: case \(fmc\(fm: printf("third"); break; ! 1968: default: ECHO; break; ! 1969: } ! 1970: } ! 1971: .TE ! 1972: should be adequate. ! 1973: .PP ! 1974: To handle the same problem with start conditions, each ! 1975: start condition must be introduced to Lex in the definitions section ! 1976: with a line reading ! 1977: .TS ! 1978: center; ! 1979: l l. ! 1980: %Start name1 name2 ... ! 1981: .TE ! 1982: where the conditions may be named in any order. ! 1983: The word \fIStart\fR may be abbreviated to \fIs\fR or \fIS\fR. ! 1984: The conditions may be referenced at the ! 1985: head of a rule with the <> brackets: ! 1986: .TS ! 1987: center; ! 1988: l. ! 1989: <name1>expression ! 1990: .TE ! 1991: is a rule which is only recognized when Lex is in the ! 1992: start condition \fIname1\fR. ! 1993: To enter a start condition, ! 1994: execute the action statement ! 1995: .TS ! 1996: center; ! 1997: l. ! 1998: BEGIN name1; ! 1999: .TE ! 2000: which changes the start condition to \fIname1\fR. ! 2001: To resume the normal state, ! 2002: .TS ! 2003: center; ! 2004: l. ! 2005: BEGIN 0; ! 2006: .TE ! 2007: resets the initial condition ! 2008: of the Lex automaton interpreter. ! 2009: A rule may be active in several ! 2010: start conditions: ! 2011: .TS ! 2012: center; ! 2013: l. ! 2014: <name1,name2,name3> ! 2015: .TE ! 2016: is a legal prefix. Any rule not beginning with the ! 2017: <> prefix operator is always active. ! 2018: .PP ! 2019: The same example as before can be written: ! 2020: .TS ! 2021: center; ! 2022: l l. ! 2023: %START AA BB CC ! 2024: %% ! 2025: ^a {ECHO; BEGIN AA;} ! 2026: ^b {ECHO; BEGIN BB;} ! 2027: ^c {ECHO; BEGIN CC;} ! 2028: \en {ECHO; BEGIN 0;} ! 2029: <AA>magic printf("first"); ! 2030: <BB>magic printf("second"); ! 2031: <CC>magic printf("third"); ! 2032: .TE ! 2033: where the logic is exactly the same as in the previous ! 2034: method of handling the problem, but Lex does the work ! 2035: rather than the user's code. ! 2036: .2C ! 2037: .NH ! 2038: Character Set. ! 2039: .PP ! 2040: The programs generated by Lex handle ! 2041: character I/O only through the routines ! 2042: .I ! 2043: input, ! 2044: output, ! 2045: .R ! 2046: and ! 2047: .I ! 2048: unput. ! 2049: .R ! 2050: Thus the character representation ! 2051: provided in these routines ! 2052: is accepted by Lex and employed to return ! 2053: values in ! 2054: .I ! 2055: yytext. ! 2056: .R ! 2057: For internal use ! 2058: a character is represented as a small integer ! 2059: which, if the standard library is used, ! 2060: has a value equal to the integer value of the bit ! 2061: pattern representing the character on the host computer. ! 2062: Normally, the letter ! 2063: .I a ! 2064: is represented as the same form as the character constant ! 2065: .I \(fma\(fm . ! 2066: If this interpretation is changed, by providing I/O ! 2067: routines which translate the characters, ! 2068: Lex must be told about ! 2069: it, by giving a translation table. ! 2070: This table must be in the definitions section, ! 2071: and must be bracketed by lines containing only ! 2072: ``%T''. ! 2073: The table contains lines of the form ! 2074: .TS ! 2075: center; ! 2076: l. ! 2077: {integer} {character string} ! 2078: .TE ! 2079: which indicate the value associated with each character. ! 2080: Thus the next example ! 2081: .GS 2 ! 2082: .TS ! 2083: center; ! 2084: l l. ! 2085: %T ! 2086: 1 Aa ! 2087: 2 Bb ! 2088: \&... ! 2089: 26 Zz ! 2090: 27 \en ! 2091: 28 + ! 2092: 29 \- ! 2093: 30 0 ! 2094: 31 1 ! 2095: \&... ! 2096: 39 9 ! 2097: %T ! 2098: .TE ! 2099: .sp ! 2100: .ce 1 ! 2101: Sample character table. ! 2102: .GE ! 2103: maps the lower and upper case letters together into the integers 1 through 26, ! 2104: newline into 27, + and \- into 28 and 29, and the ! 2105: digits into 30 through 39. ! 2106: Note the escape for newline. ! 2107: If a table is supplied, every character that is to appear either ! 2108: in the rules or in any valid input must be included ! 2109: in the table. ! 2110: No character ! 2111: may be assigned the number 0, and no character may be ! 2112: assigned a bigger number than the size of the hardware character set. ! 2113: .2C ! 2114: .NH ! 2115: Summary of Source Format. ! 2116: .PP ! 2117: The general form of a Lex source file is: ! 2118: .TS ! 2119: center; ! 2120: l. ! 2121: {definitions} ! 2122: %% ! 2123: {rules} ! 2124: %% ! 2125: {user subroutines} ! 2126: .TE ! 2127: The definitions section contains ! 2128: a combination of ! 2129: .IP 1) ! 2130: Definitions, in the form ``name space translation''. ! 2131: .IP 2) ! 2132: Included code, in the form ``space code''. ! 2133: .IP 3) ! 2134: Included code, in the form ! 2135: .TS ! 2136: center; ! 2137: l. ! 2138: %{ ! 2139: code ! 2140: %} ! 2141: .TE ! 2142: .ns ! 2143: .IP 4) ! 2144: Start conditions, given in the form ! 2145: .TS ! 2146: center; ! 2147: l. ! 2148: %S name1 name2 ... ! 2149: .TE ! 2150: .ns ! 2151: .IP 5) ! 2152: Character set tables, in the form ! 2153: .TS ! 2154: center; ! 2155: l. ! 2156: %T ! 2157: number space character-string ! 2158: \&... ! 2159: %T ! 2160: .TE ! 2161: .ns ! 2162: .IP 6) ! 2163: Changes to internal array sizes, in the form ! 2164: .TS ! 2165: center; ! 2166: l. ! 2167: %\fIx\fR\0\0\fInnn\fR ! 2168: .TE ! 2169: where \fInnn\fR is a decimal integer representing an array size ! 2170: and \fIx\fR selects the parameter as follows: ! 2171: .TS ! 2172: center; ! 2173: c c ! 2174: c l. ! 2175: Letter Parameter ! 2176: p positions ! 2177: n states ! 2178: e tree nodes ! 2179: a transitions ! 2180: k packed character classes ! 2181: o output array size ! 2182: .TE ! 2183: .LP ! 2184: Lines in the rules section have the form ``expression action'' ! 2185: where the action may be continued on succeeding ! 2186: lines by using braces to delimit it. ! 2187: .PP ! 2188: Regular expressions in Lex use the following ! 2189: operators: ! 2190: .br ! 2191: .TS ! 2192: center; ! 2193: l l. ! 2194: x the character "x" ! 2195: "x" an "x", even if x is an operator. ! 2196: \ex an "x", even if x is an operator. ! 2197: [xy] the character x or y. ! 2198: [x\-z] the characters x, y or z. ! 2199: [^x] any character but x. ! 2200: \&. any character but newline. ! 2201: ^x an x at the beginning of a line. ! 2202: <y>x an x when Lex is in start condition y. ! 2203: x$ an x at the end of a line. ! 2204: x? an optional x. ! 2205: x\(** 0,1,2, ... instances of x. ! 2206: x+ 1,2,3, ... instances of x. ! 2207: x|y an x or a y. ! 2208: (x) an x. ! 2209: x/y an x but only if followed by y. ! 2210: {xx} the translation of xx from the ! 2211: definitions section. ! 2212: x{m,n} \fIm\fR through \fIn\fR occurrences of x ! 2213: .TE ! 2214: .NH ! 2215: Caveats and Bugs. ! 2216: .PP ! 2217: There are pathological expressions which ! 2218: produce exponential growth of the tables when ! 2219: converted to deterministic machines; ! 2220: fortunately, they are rare. ! 2221: .PP ! 2222: REJECT does not rescan the input; instead it remembers the results of the previous ! 2223: scan. This means that if a rule with trailing context is found, and ! 2224: REJECT executed, the user ! 2225: must not have used ! 2226: .ul ! 2227: unput ! 2228: to change the characters forthcoming ! 2229: from the input stream. ! 2230: This is the only restriction on the user's ability to manipulate ! 2231: the not-yet-processed input. ! 2232: .PP ! 2233: .2C ! 2234: .NH ! 2235: Acknowledgments. ! 2236: .PP ! 2237: As should ! 2238: be obvious from the above, the outside of Lex ! 2239: is patterned ! 2240: on Yacc and the inside on Aho's string matching routines. ! 2241: Therefore, both S. C. Johnson and A. V. Aho ! 2242: are really originators ! 2243: of much of Lex, ! 2244: as well as debuggers of it. ! 2245: Many thanks are due to both. ! 2246: .PP ! 2247: The code of the current version of Lex was designed, written, ! 2248: and debugged by Eric Schmidt. ! 2249: .SG MH-1274-MEL-unix ! 2250: .sp 1 ! 2251: .2C ! 2252: .NH ! 2253: References. ! 2254: .SP 1v ! 2255: .IP 1. ! 2256: B. W. Kernighan and D. M. Ritchie, ! 2257: .I ! 2258: The C Programming Language, ! 2259: .R ! 2260: Prentice-Hall, N. J. (1978). ! 2261: .IP 2. ! 2262: B. W. Kernighan, ! 2263: .I ! 2264: Ratfor: A Preprocessor for a Rational Fortran, ! 2265: .R ! 2266: Software \- Practice and Experience, ! 2267: \fB5\fR, pp. 395-496 (1975). ! 2268: .IP 3. ! 2269: S. C. Johnson, ! 2270: .I ! 2271: Yacc: Yet Another Compiler Compiler, ! 2272: .R ! 2273: Computing Science Technical Report No. 32, ! 2274: 1975, ! 2275: .MH ! 2276: .if \n(tm (also TM 75-1273-6) ! 2277: .IP 4. ! 2278: A. V. Aho and M. J. Corasick, ! 2279: .I ! 2280: Efficient String Matching: An Aid to Bibliographic Search, ! 2281: .R ! 2282: Comm. ACM ! 2283: .B ! 2284: 18, ! 2285: .R ! 2286: 333-340 (1975). ! 2287: .IP 5. ! 2288: B. W. Kernighan, D. M. Ritchie and K. L. Thompson, ! 2289: .I ! 2290: QED Text Editor, ! 2291: .R ! 2292: Computing Science Technical Report No. 5, ! 2293: 1972, ! 2294: .MH ! 2295: .IP 6. ! 2296: D. M. Ritchie, ! 2297: private communication. ! 2298: See also ! 2299: M. E. Lesk, ! 2300: .I ! 2301: The Portable C Library, ! 2302: .R ! 2303: Computing Science Technical Report No. 31, ! 2304: .MH ! 2305: .if \n(tm (also TM 75-1274-11)
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.