Annotation of 43BSDReno/share/doc/ps1/16.lex/lex.ms, revision 1.1

1.1     ! root        1: .\"    @(#)lex.ms      6.2 (Berkeley) 5/1/87
        !             2: .\"
        !             3: .EH 'PS1:16-%''Lex \- A Lexical Analyzer Generator'
        !             4: .OH 'Lex \- A Lexical Analyzer Generator''PS1:16-%'
        !             5: .hc ~
        !             6: .bd I 2
        !             7: .de TS
        !             8: .br
        !             9: .nf
        !            10: .SP 1v
        !            11: .ul 0
        !            12: ..
        !            13: .de TE
        !            14: .SP 1v
        !            15: .fi
        !            16: ..
        !            17: .\".de PT
        !            18: .\".if \\n%>1 'tl ''\s7LEX\s0\s9\(mi%\s0''
        !            19: .\".if \\n%>1 'sp
        !            20: .\"..
        !            21: .ND July 21, 1975
        !            22: .\".RP
        !            23: .\".TM 75-1274-15 39199 39199-11
        !            24: .TL
        !            25: Lex \- A Lexical Analyzer ~Generator~
        !            26: .AU ``MH 2C-569'' 6377
        !            27: M. E. Lesk and E. Schmidt
        !            28: .AI
        !            29: .MH
        !            30: .AB
        !            31: .sp
        !            32: .bd I 2
        !            33: .\".nr PS 8
        !            34: .\".nr VS 9
        !            35: .\".ps 8
        !            36: .\".vs 9p
        !            37: Lex helps write programs whose control flow
        !            38: is directed by instances of regular
        !            39: expressions in the input stream.
        !            40: It is well suited for editor-script type transformations and
        !            41: for segmenting input in preparation for
        !            42: a parsing routine.
        !            43: .PP
        !            44: Lex source is a table of regular expressions and corresponding program fragments.
        !            45: The table is translated to a program
        !            46: which reads an input stream, copying it to an output stream
        !            47: and partitioning the input
        !            48: into strings which match the given expressions.
        !            49: As each such string is recognized the corresponding
        !            50: program fragment is executed.
        !            51: The recognition of the expressions
        !            52: is performed by a deterministic finite automaton
        !            53: generated by Lex.
        !            54: The program fragments written by the user are executed in the order in which the
        !            55: corresponding regular expressions occur in the input stream.
        !            56: .if n .if \n(tm .ig
        !            57: .PP
        !            58: The lexical analysis
        !            59: programs written with Lex accept ambiguous specifications
        !            60: and choose the longest
        !            61: match possible at each input point.
        !            62: If necessary, substantial look~ahead
        !            63: is performed on the input, but the
        !            64: input stream will be backed up to the
        !            65: end of the current partition, so that the user
        !            66: has general freedom to manipulate it.
        !            67: .PP
        !            68: Lex can generate analyzers in either C or Ratfor, a language
        !            69: which can be translated automatically to portable Fortran.
        !            70: It is available on the PDP-11 UNIX, Honeywell GCOS,
        !            71: and IBM OS systems.
        !            72: This manual, however, will only discuss generating analyzers
        !            73: in C on the UNIX system, which is the only supported
        !            74: form of Lex under UNIX Version 7.
        !            75: Lex is designed to simplify
        !            76: interfacing with Yacc, for those
        !            77: with access to this compiler-compiler system.
        !            78: ..
        !            79: .\".nr PS 9
        !            80: .\".nr VS 11
        !            81: .AE
        !            82: .2C
        !            83: .NH
        !            84: Introduction.
        !            85: .PP
        !            86: Lex is a program generator designed for
        !            87: lexical processing of character input streams.
        !            88: It accepts a high-level, problem oriented specification
        !            89: for character string matching,
        !            90: and
        !            91: produces a program in a general purpose language which recognizes
        !            92: regular expressions.
        !            93: The regular expressions are specified by the user in the
        !            94: source specifications given to Lex.
        !            95: The Lex written code recognizes these expressions
        !            96: in an input stream and partitions the input stream into
        !            97: strings matching the expressions.  At the bound~aries
        !            98: between strings
        !            99: program sections
        !           100: provided by the user are executed.
        !           101: The Lex source file associates the regular expressions and the
        !           102: program fragments.
        !           103: As each expression appears in the input to the program written by Lex,
        !           104: the corresponding fragment is executed.
        !           105: .PP
        !           106: .de MH
        !           107: Bell Laboratories, Murray Hill, NJ 07974.
        !           108: ..
        !           109: The user supplies the additional code
        !           110: beyond expression matching
        !           111: needed to complete his tasks, possibly
        !           112: including code written by other generators.
        !           113: The program that recognizes the expressions is generated in the
        !           114: general purpose programming language employed for the
        !           115: user's program fragments.
        !           116: Thus, a high level expression
        !           117: language is provided to write the string expressions to be
        !           118: matched while the user's freedom to write actions
        !           119: is unimpaired.
        !           120: This avoids forcing the user who wishes to use a string manipulation
        !           121: language for input analysis to write processing programs in the same
        !           122: and often inappropriate string handling language.
        !           123: .PP
        !           124: Lex is not a complete language, but rather a generator representing
        !           125: a new language feature which can be added to
        !           126: different programming languages, called ``host languages.'' 
        !           127: Just as general purpose languages
        !           128: can produce code to run on different computer hardware,
        !           129: Lex can write code in different host languages.
        !           130: The host language is used for the output code generated by Lex
        !           131: and also for the program fragments added by the user.
        !           132: Compatible run-time libraries for the different host languages
        !           133: are also provided.
        !           134: This makes Lex adaptable to different environments and
        !           135: different users.
        !           136: Each application
        !           137: may be directed to the combination of hardware and host language appropriate
        !           138: to the task, the user's background, and the properties of local
        !           139: implementations.
        !           140: At present, the only supported host language is C,
        !           141: although Fortran (in the form of Ratfor [2] has been available
        !           142: in the past.
        !           143: Lex itself exists on UNIX, GCOS, and OS/370; but the
        !           144: code generated by Lex may be taken anywhere the appropriate
        !           145: compilers exist.
        !           146: .PP
        !           147: Lex turns the user's expressions and actions
        !           148: (called
        !           149: .ul
        !           150: source
        !           151: in this memo) into the host general-purpose language;
        !           152: the generated program is named
        !           153: .ul
        !           154: yylex.
        !           155: The
        !           156: .ul
        !           157: yylex
        !           158: program
        !           159: will recognize expressions
        !           160: in a stream
        !           161: (called
        !           162: .ul
        !           163: input
        !           164: in this memo)
        !           165: and perform the specified actions for each expression as it is detected.
        !           166: See Figure 1.
        !           167: .GS
        !           168: .TS
        !           169: center;
        !           170: l _ r
        !           171: l|c|r
        !           172: l _ r
        !           173: l _ r
        !           174: l|c|r
        !           175: l _ r
        !           176: c s s
        !           177: c s s.
        !           178: 
        !           179: Source \(->    Lex     \(-> yylex
        !           180: 
        !           181: .sp 2
        !           182: 
        !           183: Input \(->     yylex   \(-> Output
        !           184: 
        !           185: .sp
        !           186: An overview of Lex
        !           187: Figure 1
        !           188: .TE
        !           189: .GE
        !           190: .PP
        !           191: For a trivial example, consider a program to delete
        !           192: from the input
        !           193: all blanks or tabs at the ends of lines.
        !           194: .TS
        !           195: center;
        !           196: l l.
        !           197: %%
        !           198: [ \et]+$       ;
        !           199: .TE
        !           200: is all that is required.
        !           201: The program
        !           202: contains a %% delimiter to mark the beginning of the rules, and
        !           203: one rule.
        !           204: This rule contains a regular expression
        !           205: which matches one or more
        !           206: instances of the characters blank or tab
        !           207: (written \et for visibility, in accordance with the C language convention)
        !           208: just prior to the end of a line.
        !           209: The brackets indicate the character
        !           210: class made of blank and tab; the + indicates ``one or more ...'';
        !           211: and the $ indicates ``end of line,'' as in QED.
        !           212: No action is specified,
        !           213: so the program generated by Lex (yylex) will ignore these characters.
        !           214: Everything else will be copied.
        !           215: To change any remaining
        !           216: string of blanks or tabs to a single blank,
        !           217: add another rule:
        !           218: .TS
        !           219: center;
        !           220: l l.
        !           221: %%
        !           222: [ \et]+$       ;
        !           223: [ \et]+        printf(" ");
        !           224: .TE
        !           225: The finite automaton generated for this
        !           226: source will scan for both rules at once,
        !           227: observing at
        !           228: the termination of the string of blanks or tabs
        !           229: whether or not there is a newline character, and executing
        !           230: the desired rule action.
        !           231: The first rule matches all strings of blanks or tabs
        !           232: at the end of lines, and the second
        !           233: rule all remaining strings of blanks or tabs.
        !           234: .PP
        !           235: Lex can be used alone for simple transformations, or
        !           236: for analysis and statistics gathering on a lexical level.
        !           237: Lex can also be used with a parser generator
        !           238: to perform the lexical analysis phase; it is particularly
        !           239: easy to interface Lex and Yacc [3].
        !           240: Lex programs recognize only regular expressions;
        !           241: Yacc writes parsers that accept a large class of context free grammars,
        !           242: but require a lower level analyzer to recognize input tokens.
        !           243: Thus, a combination of Lex and Yacc is often appropriate.
        !           244: When used as a preprocessor for a later parser generator,
        !           245: Lex is used to partition the input stream,
        !           246: and the parser generator assigns structure to
        !           247: the resulting pieces.
        !           248: The flow of control
        !           249: in such a case (which might be the first half of a compiler,
        !           250: for example) is shown in Figure 2.
        !           251: Additional programs,
        !           252: written by other generators
        !           253: or by hand, can
        !           254: be added easily to programs written by Lex.
        !           255: .BS 2
        !           256: .ps 9
        !           257: .vs 11
        !           258: .TS
        !           259: center;
        !           260: l c c c l
        !           261: l c c c l
        !           262: l c c c l
        !           263: l _ c _ l
        !           264: l|c|c|c|l
        !           265: l _ c _ l
        !           266: l c c c l
        !           267: l _ c _ l
        !           268: l|c|c|c|l
        !           269: l _ c _ l
        !           270: l c s s l
        !           271: l c s s l.
        !           272:        lexical         grammar
        !           273:        rules           rules
        !           274:        \(da            \(da
        !           275: 
        !           276:        Lex             Yacc
        !           277: 
        !           278:        \(da            \(da
        !           279: 
        !           280: Input \(->     yylex   \(->    yyparse \(-> Parsed input
        !           281: 
        !           282: .sp
        !           283:        Lex with Yacc
        !           284:        Figure 2
        !           285: .TE
        !           286: .ps 10
        !           287: .vs 12
        !           288: .BE
        !           289: Yacc users
        !           290: will realize that the name
        !           291: .ul
        !           292: yylex
        !           293: is what Yacc expects its lexical analyzer to be named,
        !           294: so that the use of this name by Lex simplifies
        !           295: interfacing.
        !           296: .PP
        !           297: Lex generates a deterministic finite automaton from the regular expressions
        !           298: in the source [4].
        !           299: The automaton is interpreted, rather than compiled, in order
        !           300: to save space.
        !           301: The result is still a fast analyzer.
        !           302: In particular, the time taken by a Lex program
        !           303: to recognize and partition an input stream is
        !           304: proportional to the length of the input.
        !           305: The number of Lex rules or
        !           306: the complexity of the rules is
        !           307: not important in determining speed,
        !           308: unless rules which include
        !           309: forward context require a significant amount of re~scanning.
        !           310: What does increase with the number and complexity of rules
        !           311: is the size of the finite
        !           312: automaton, and therefore the size of the program
        !           313: generated by Lex.
        !           314: .PP
        !           315: In the program written by Lex, the user's fragments
        !           316: (representing the
        !           317: .ul
        !           318: actions
        !           319: to be performed as each regular expression
        !           320: is found)
        !           321: are gathered
        !           322: as cases of a switch.
        !           323: The automaton interpreter directs the control flow.
        !           324: Opportunity is provided for the user to insert either
        !           325: declarations or additional statements in the routine containing
        !           326: the actions, or to
        !           327: add subroutines outside this action routine.
        !           328: .PP
        !           329: Lex is not limited to source which can
        !           330: be interpreted on the basis of one character
        !           331: look~ahead.
        !           332: For example,
        !           333: if there are two rules, one looking for
        !           334: .I ab
        !           335: and another for
        !           336: .I abcdefg ,
        !           337: and the input stream is
        !           338: .I abcdefh ,
        !           339: Lex will recognize
        !           340: .I ab
        !           341: and leave
        !           342: the input pointer just before
        !           343: .I "cd. . ."
        !           344: Such backup is more costly
        !           345: than the processing of simpler languages.
        !           346: .2C
        !           347: .NH
        !           348: Lex Source.
        !           349: .PP
        !           350: The general format of Lex source is:
        !           351: .TS
        !           352: center;
        !           353: l.
        !           354: {definitions}
        !           355: %%
        !           356: {rules}
        !           357: %%
        !           358: {user subroutines}
        !           359: .TE
        !           360: where the definitions and the user subroutines
        !           361: are often omitted.
        !           362: The second
        !           363: .I %%
        !           364: is optional, but the first is required
        !           365: to mark the beginning of the rules.
        !           366: The absolute minimum Lex program is thus
        !           367: .TS
        !           368: center;
        !           369: l.
        !           370: %%
        !           371: .TE
        !           372: (no definitions, no rules) which translates into a program
        !           373: which copies the input to the output unchanged.
        !           374: .PP
        !           375: In the outline of Lex programs shown above, the
        !           376: .I
        !           377: rules
        !           378: .R
        !           379: represent the user's control
        !           380: decisions; they are a table, in which the left column
        !           381: contains
        !           382: .I
        !           383: regular expressions
        !           384: .R
        !           385: (see section 3)
        !           386: and the right column contains
        !           387: .I
        !           388: actions,
        !           389: .R
        !           390: program fragments to be executed when the expressions
        !           391: are recognized.
        !           392: Thus an individual rule might appear
        !           393: .TS
        !           394: center;
        !           395: l l.
        !           396: integer        printf("found keyword INT");
        !           397: .TE
        !           398: to look for the string
        !           399: .I integer
        !           400: in the input stream and
        !           401: print the message ``found keyword INT'' whenever it appears.
        !           402: In this example the host procedural language is C and
        !           403: the C library function
        !           404: .I
        !           405: printf
        !           406: .R
        !           407: is used to print the string.
        !           408: The end
        !           409: of the expression is indicated by the first blank or tab character.
        !           410: If the action is merely a single C expression,
        !           411: it can just be given on the right side of the line; if it is
        !           412: compound, or takes more than a line, it should be enclosed in
        !           413: braces.
        !           414: As a slightly more useful example, suppose it is desired to
        !           415: change a number of words from British to American spelling.
        !           416: Lex rules such as
        !           417: .TS
        !           418: center;
        !           419: l l.
        !           420: colour printf("color");
        !           421: mechanise      printf("mechanize");
        !           422: petrol printf("gas");
        !           423: .TE
        !           424: would be a start.  These rules are not quite enough,
        !           425: since
        !           426: the word
        !           427: .I petroleum
        !           428: would become
        !           429: .I gaseum ;
        !           430: a way of dealing
        !           431: with this will be described later.
        !           432: .2C
        !           433: .NH
        !           434: Lex Regular Expressions.
        !           435: .PP
        !           436: The definitions of regular expressions are very similar to those
        !           437: in QED [5].
        !           438: A regular
        !           439: expression specifies a set of strings to be matched.
        !           440: It contains text characters (which match the corresponding
        !           441: characters in the strings being compared)
        !           442: and operator characters (which specify
        !           443: repetitions, choices, and other features).
        !           444: The letters of the alphabet and the digits are
        !           445: always text characters; thus the regular expression
        !           446: .TS
        !           447: center;
        !           448: l l.
        !           449: integer
        !           450: .TE
        !           451: matches the string
        !           452: .ul
        !           453: integer
        !           454: wherever it appears
        !           455: and the expression
        !           456: .TS
        !           457: center;
        !           458: l.
        !           459: a57D
        !           460: .TE
        !           461: looks for the string
        !           462: .ul
        !           463: a57D.
        !           464: .PP
        !           465: .I
        !           466: Operators.
        !           467: .R
        !           468: The operator characters are
        !           469: .TS
        !           470: center;
        !           471: l.
        !           472: " \e [ ] ^ \- ? . \(** + | ( ) $ / { } % < >
        !           473: .TE
        !           474: and if they are to be used as text characters, an escape
        !           475: should be used.
        !           476: The quotation mark operator (")
        !           477: indicates that whatever is contained between a pair of quotes
        !           478: is to be taken as text characters.
        !           479: Thus
        !           480: .TS
        !           481: center;
        !           482: l.
        !           483: xyz"++"
        !           484: .TE
        !           485: matches the string
        !           486: .I xyz++
        !           487: when it appears.  Note that a part of a string may be quoted.
        !           488: It is harmless but unnecessary to quote an ordinary
        !           489: text character; the expression
        !           490: .TS
        !           491: center;
        !           492: l.
        !           493: "xyz++"
        !           494: .TE
        !           495: is the same as the one above.
        !           496: Thus by quoting every non-alphanumeric character
        !           497: being used as a text character, the user can avoid remembering
        !           498: the list above of current
        !           499: operator characters, and is safe should further extensions to Lex
        !           500: lengthen the list.
        !           501: .PP
        !           502: An operator character may also be turned into a text character
        !           503: by preceding it with \e as in
        !           504: .TS
        !           505: center;
        !           506: l.
        !           507: xyz\e+\e+
        !           508: .TE
        !           509: which
        !           510: is another, less readable, equivalent of the above expressions.
        !           511: Another use of the quoting mechanism is to get a blank into
        !           512: an expression; normally, as explained above, blanks or tabs end
        !           513: a rule.
        !           514: Any blank character not contained within [\|] (see below) must
        !           515: be quoted.
        !           516: Several normal C escapes with \e
        !           517: are recognized: \en is newline, \et is tab, and \eb is backspace.
        !           518: To enter \e itself, use \e\e.
        !           519: Since newline is illegal in an expression, \en must be used;
        !           520: it is not
        !           521: required to escape tab and backspace.
        !           522: Every character but blank, tab, newline and the list above is always
        !           523: a text character.
        !           524: .PP
        !           525: .I
        !           526: Character classes.
        !           527: .R
        !           528: Classes of characters can be specified using the operator pair [\|].
        !           529: The construction
        !           530: .I [abc]
        !           531: matches a
        !           532: single character, which may be
        !           533: .I a ,
        !           534: .I b ,
        !           535: or
        !           536: .I c .
        !           537: Within square brackets,
        !           538: most operator meanings are ignored.
        !           539: Only three characters are special:
        !           540: these are \e \(mi and ^.  The \(mi character
        !           541: indicates ranges.  For example,
        !           542: .TS
        !           543: center;
        !           544: l.
        !           545: [a\(miz0\(mi9<>_]
        !           546: .TE
        !           547: indicates the character class containing all the lower case letters,
        !           548: the digits,
        !           549: the angle brackets, and underline.
        !           550: Ranges may be given in either order.
        !           551: Using \(mi between any pair of characters which are
        !           552: not both upper case letters, both lower case letters, or both digits
        !           553: is implementation dependent and will get a warning message.
        !           554: (E.g., [0\-z] in ASCII is many more characters
        !           555: than it is in EBCDIC).
        !           556: If it is desired to include the
        !           557: character \(mi in a character class, it should be first or
        !           558: last; thus
        !           559: .TS
        !           560: center;
        !           561: l.
        !           562: [\(mi+0\(mi9]
        !           563: .TE
        !           564: matches all the digits and the two signs.
        !           565: .PP
        !           566: In character classes,
        !           567: the ^ operator must appear as the first character
        !           568: after the left bracket; it indicates that the resulting string
        !           569: is to be complemented with respect to the computer character set.
        !           570: Thus
        !           571: .TS
        !           572: center;
        !           573: l.
        !           574: [^abc]
        !           575: .TE
        !           576: matches all characters except a, b, or c, including
        !           577: all special or control characters; or
        !           578: .TS
        !           579: center;
        !           580: l.
        !           581: [^a\-zA\-Z]
        !           582: .TE
        !           583: is any character which is not a letter.
        !           584: The \e character provides the usual escapes within
        !           585: character class brackets.
        !           586: .PP
        !           587: .I
        !           588: Arbitrary character.
        !           589: .R
        !           590: To match almost any character, the operator character
        !           591: .TS
        !           592: center;
        !           593: l.
        !           594: \&.
        !           595: .TE
        !           596: is the class of all characters except newline.
        !           597: Escaping into octal is possible although non-portable:
        !           598: .TS
        !           599: center;
        !           600: l.
        !           601: [\e40\-\e176]
        !           602: .TE
        !           603: matches all printable characters in the ASCII character set, from octal
        !           604: 40 (blank) to octal 176 (tilde).
        !           605: .PP
        !           606: .I
        !           607: Optional expressions.
        !           608: .R
        !           609: The operator
        !           610: .I ?
        !           611: indicates
        !           612: an optional element of an expression.
        !           613: Thus
        !           614: .TS
        !           615: center;
        !           616: l.
        !           617: ab?c
        !           618: .TE
        !           619: matches either
        !           620: .I ac
        !           621: or
        !           622: .I abc .
        !           623: .PP
        !           624: .I
        !           625: Repeated expressions.
        !           626: .R
        !           627: Repetitions of classes are indicated by the operators
        !           628: .I \(**
        !           629: and
        !           630: .I + .
        !           631: .TS
        !           632: center;
        !           633: l.
        !           634: \f2a\(**\f1
        !           635: .TE
        !           636: is any number of consecutive
        !           637: .I a
        !           638: characters, including zero; while
        !           639: .TS
        !           640: center;
        !           641: l.
        !           642: a+
        !           643: .TE
        !           644: is one or more instances of
        !           645: .I a.
        !           646: For example,
        !           647: .TS
        !           648: center;
        !           649: l.
        !           650: [a\-z]+
        !           651: .TE
        !           652: is all strings of lower case letters.
        !           653: And
        !           654: .TS
        !           655: center;
        !           656: l.
        !           657: [A\(miZa\(miz][A\(miZa\(miz0\(mi9]\(**
        !           658: .TE
        !           659: indicates all alphanumeric strings with a leading
        !           660: alphabetic character.
        !           661: This is a typical expression for recognizing identifiers in
        !           662: computer languages.
        !           663: .PP
        !           664: .I
        !           665: Alternation and Grouping.
        !           666: .R
        !           667: The operator |
        !           668: indicates alternation:
        !           669: .TS
        !           670: center;
        !           671: l.
        !           672: (ab\||\|cd)
        !           673: .TE
        !           674: matches either
        !           675: .ul
        !           676: ab
        !           677: or
        !           678: .ul
        !           679: cd.
        !           680: Note that parentheses are used for grouping, although
        !           681: they are
        !           682: not necessary on the outside level;
        !           683: .TS
        !           684: center;
        !           685: l.
        !           686: ab\||\|cd
        !           687: .TE
        !           688: would have sufficed.
        !           689: Parentheses
        !           690: can be used for more complex expressions:
        !           691: .TS
        !           692: center;
        !           693: l.
        !           694: (ab\||\|cd+)?(ef)\(**
        !           695: .TE
        !           696: matches such strings as
        !           697: .I abefef ,
        !           698: .I efefef ,
        !           699: .I cdef ,
        !           700: or
        !           701: .I cddd\| ;
        !           702: but not
        !           703: .I abc ,
        !           704: .I abcd ,
        !           705: or
        !           706: .I abcdef .
        !           707: .PP
        !           708: .I
        !           709: Context sensitivity.
        !           710: .R
        !           711: Lex will recognize a small amount of surrounding
        !           712: context.  The two simplest operators for this are
        !           713: .I ^
        !           714: and
        !           715: .I $ .
        !           716: If the first character of an expression is
        !           717: .I ^ ,
        !           718: the expression will only be matched at the beginning
        !           719: of a line (after a newline character, or at the beginning of
        !           720: the input stream).
        !           721: This can never conflict with the other meaning of
        !           722: .I ^ ,
        !           723: complementation
        !           724: of character classes, since that only applies within
        !           725: the [\|] operators.
        !           726: If the very last character is
        !           727: .I $ ,
        !           728: the expression will only be matched at the end of a line (when
        !           729: immediately followed by newline).
        !           730: The latter operator is a special case of the
        !           731: .I /
        !           732: operator character,
        !           733: which indicates trailing context.
        !           734: The expression
        !           735: .TS
        !           736: center;
        !           737: l.
        !           738: ab/cd
        !           739: .TE
        !           740: matches the string
        !           741: .I ab ,
        !           742: but only if followed by
        !           743: .ul
        !           744: cd.
        !           745: Thus
        !           746: .TS
        !           747: center;
        !           748: l.
        !           749: ab$
        !           750: .TE
        !           751: is the same as
        !           752: .TS
        !           753: center;
        !           754: l.
        !           755: ab/\en
        !           756: .TE
        !           757: Left context is handled in Lex by
        !           758: .I
        !           759: start conditions
        !           760: .R
        !           761: as explained in section 10.  If a rule is only to be executed
        !           762: when the Lex automaton interpreter is in start condition
        !           763: .I
        !           764: x,
        !           765: .R
        !           766: the rule should be prefixed by
        !           767: .TS
        !           768: center;
        !           769: l.
        !           770: <x>
        !           771: .TE
        !           772: using the angle bracket operator characters.
        !           773: If we considered ``being at the beginning of a line'' to be
        !           774: start condition
        !           775: .I ONE ,
        !           776: then the ^ operator
        !           777: would be equivalent to
        !           778: .TS
        !           779: center;
        !           780: l.
        !           781: <ONE>
        !           782: .TE
        !           783: Start conditions are explained more fully later.
        !           784: .PP
        !           785: .I
        !           786: Repetitions and Definitions.
        !           787: .R
        !           788: The operators {} specify
        !           789: either repetitions (if they enclose numbers)
        !           790: or
        !           791: definition expansion (if they enclose a name).  For example
        !           792: .TS
        !           793: center;
        !           794: l.
        !           795: {digit}
        !           796: .TE
        !           797: looks for a predefined string named
        !           798: .I digit
        !           799: and inserts it
        !           800: at that point in the expression.
        !           801: The definitions are given in the first part of the Lex
        !           802: input, before the rules.
        !           803: In contrast,
        !           804: .TS
        !           805: center;
        !           806: l.
        !           807: a{1,5}
        !           808: .TE
        !           809: looks for 1 to 5 occurrences of
        !           810: .I a .
        !           811: .PP
        !           812: Finally, initial
        !           813: .I %
        !           814: is special, being the separator
        !           815: for Lex source segments.
        !           816: .2C
        !           817: .NH
        !           818: Lex Actions.
        !           819: .PP
        !           820: When an expression written as above is matched, Lex
        !           821: executes the corresponding action.  This section describes
        !           822: some features of Lex which aid in writing actions.  Note
        !           823: that there is a default action, which
        !           824: consists of copying the input to the output.  This
        !           825: is performed on all strings not otherwise matched.  Thus
        !           826: the Lex user who wishes to absorb the entire input, without
        !           827: producing any output, must provide rules to match everything.
        !           828: When Lex is being used with Yacc, this is the normal
        !           829: situation.
        !           830: One may consider that actions are what is done instead of
        !           831: copying the input to the output; thus, in general,
        !           832: a rule which merely copies can be omitted.
        !           833: Also, a character combination
        !           834: which is omitted from the rules
        !           835: and which appears as input
        !           836: is likely to be printed on the output, thus calling
        !           837: attention to the gap in the rules.
        !           838: .PP
        !           839: One of the simplest things that can be done is to ignore
        !           840: the input.   Specifying a C null statement, \fI;\fR as an action
        !           841: causes this result.  A frequent rule is
        !           842: .TS
        !           843: center;
        !           844: l l.
        !           845: [ \et\en]      ;
        !           846: .TE
        !           847: which causes the three spacing characters (blank, tab, and newline)
        !           848: to be ignored.
        !           849: .PP
        !           850: Another easy way to avoid writing actions is the action character
        !           851: |, which indicates that the action for this rule is the action
        !           852: for the next rule.
        !           853: The previous example could also have been written
        !           854: .TS
        !           855: center;
        !           856: l l.
        !           857: " "            |
        !           858: "\et"          |
        !           859: "\en"          ;
        !           860: .TE
        !           861: with the same result, although in different style.
        !           862: The quotes around \en and \et are not required.
        !           863: .PP
        !           864: In more complex actions, the user
        !           865: will
        !           866: often want to know the actual text that matched some expression
        !           867: like
        !           868: .I [a\(miz]+ .
        !           869: Lex leaves this text in an external character
        !           870: array named
        !           871: .I
        !           872: yytext.
        !           873: .R
        !           874: Thus, to print the name found,
        !           875: a rule like
        !           876: .TS
        !           877: center;
        !           878: l l.
        !           879: [a\-z]+        printf("%s", yytext);
        !           880: .TE
        !           881: will print
        !           882: the string in
        !           883: .I
        !           884: yytext.
        !           885: .R
        !           886: The C function
        !           887: .I
        !           888: printf
        !           889: .R
        !           890: accepts a format argument and data to be printed;
        !           891: in this case, the format is ``print string'' (% indicating
        !           892: data conversion, and
        !           893: .I s
        !           894: indicating string type),
        !           895: and the data are the characters
        !           896: in
        !           897: .I
        !           898: yytext.
        !           899: .R
        !           900: So this just places
        !           901: the matched string
        !           902: on the output.
        !           903: This action
        !           904: is so common that
        !           905: it may be written as ECHO:
        !           906: .TS
        !           907: center;
        !           908: l l.
        !           909: [a\-z]+        ECHO;
        !           910: .TE
        !           911: is the same as the above.
        !           912: Since the default action is just to
        !           913: print the characters found, one might ask why
        !           914: give a rule, like this one, which merely specifies
        !           915: the default action?
        !           916: Such rules are often required
        !           917: to avoid matching some other rule
        !           918: which is not desired.  For example, if there is a rule
        !           919: which matches
        !           920: .I read
        !           921: it will normally match the instances of
        !           922: .I read
        !           923: contained in
        !           924: .I bread
        !           925: or
        !           926: .I readjust ;
        !           927: to avoid
        !           928: this,
        !           929: a rule
        !           930: of the form
        !           931: .I [a\(miz]+
        !           932: is needed.
        !           933: This is explained further below.
        !           934: .PP
        !           935: Sometimes it is more convenient to know the end of what
        !           936: has been found; hence Lex also provides a count
        !           937: .I
        !           938: yyleng
        !           939: .R
        !           940: of the number of characters matched.
        !           941: To count both the number
        !           942: of words and the number of characters in words in the input, the user might write
        !           943: .TS
        !           944: center;
        !           945: l l.
        !           946: [a\-zA\-Z]+    {words++; chars += yyleng;}
        !           947: .TE
        !           948: which accumulates in
        !           949: .ul
        !           950: chars
        !           951: the number
        !           952: of characters in the words recognized.
        !           953: The last character in the string matched can
        !           954: be accessed by
        !           955: .TS
        !           956: center;
        !           957: l.
        !           958: yytext[yyleng\-1]
        !           959: .TE
        !           960: .PP
        !           961: Occasionally, a Lex
        !           962: action may decide that a rule has not recognized the correct
        !           963: span of characters.
        !           964: Two routines are provided to aid with this situation.
        !           965: First,
        !           966: .I
        !           967: yymore()
        !           968: .R
        !           969: can be called to indicate that the next input expression recognized is to be
        !           970: tacked on to the end of this input.  Normally,
        !           971: the next input string would overwrite the current
        !           972: entry in
        !           973: .I
        !           974: yytext.
        !           975: .R
        !           976: Second,
        !           977: .I
        !           978: yyless (n)
        !           979: .R
        !           980: may be called to indicate that not all the characters matched
        !           981: by the currently successful expression are wanted right now.
        !           982: The argument
        !           983: .I
        !           984: n
        !           985: .R
        !           986: indicates the number of characters
        !           987: in
        !           988: .I
        !           989: yytext
        !           990: .R
        !           991: to be retained.
        !           992: Further characters previously matched
        !           993: are
        !           994: returned to the input.  This provides the same sort of
        !           995: look~ahead offered by the / operator,
        !           996: but in a different form.
        !           997: .PP
        !           998: .I
        !           999: Example:
        !          1000: .R
        !          1001: Consider a language which defines
        !          1002: a string as a set of characters between quotation (") marks, and provides that
        !          1003: to include a " in a string it must be preceded by a \e.  The
        !          1004: regular expression which matches that is somewhat confusing,
        !          1005: so that it might be preferable to write
        !          1006: .TS
        !          1007: center;
        !          1008: l l.
        !          1009: \e"[^"]\(**    {
        !          1010:        if (yytext[yyleng\-1] == \(fm\e\e\(fm)
        !          1011:             yymore();
        !          1012:        else
        !          1013:             ... normal user processing
        !          1014:        }
        !          1015: .TE
        !          1016: which will, when faced with a string such as
        !          1017: .I
        !          1018: "abc\e"def\|"
        !          1019: .R
        !          1020: first match
        !          1021: the five characters
        !          1022: \fI"abc\e\|\fR;
        !          1023: then
        !          1024: the call to
        !          1025: .I yymore()
        !          1026: will
        !          1027: cause the next part of the string,
        !          1028: \fI"def\|\fR,
        !          1029: to be tacked on the end.
        !          1030: Note that the final quote terminating the string should be picked
        !          1031: up in the code labeled ``normal processing''.
        !          1032: .PP
        !          1033: The function
        !          1034: .I
        !          1035: yyless()
        !          1036: .R
        !          1037: might be used to reprocess
        !          1038: text in various circumstances.  Consider the C problem of distinguishing
        !          1039: the ambiguity of ``=\(mia''.
        !          1040: Suppose it is desired to treat this as ``=\(mi a''
        !          1041: but print a message.  A rule might be
        !          1042: .ps 9
        !          1043: .vs 11
        !          1044: .TS
        !          1045: center;
        !          1046: l l.
        !          1047: =\(mi[a\-zA\-Z]        {
        !          1048:        printf("Op (=\(mi) ambiguous\en");
        !          1049:        yyless(yyleng\-1);
        !          1050:        ... action for =\(mi ...
        !          1051:        }
        !          1052: .TE
        !          1053: .ps 10
        !          1054: .vs 12
        !          1055: which prints a message, returns the letter after the
        !          1056: operator to the input stream, and treats the operator as ``=\(mi''.
        !          1057: Alternatively it might be desired to treat this as ``=  \(mia''.
        !          1058: To do this, just return the minus
        !          1059: sign as well as the letter to the input:
        !          1060: .ps 9
        !          1061: .vs 11
        !          1062: .TS
        !          1063: center;
        !          1064: l l.
        !          1065: =\(mi[a\-zA\-Z]        {
        !          1066:        printf("Op (=\(mi) ambiguous\en");
        !          1067:        yyless(yyleng\-2);
        !          1068:        ... action for = ...
        !          1069:        }
        !          1070: .TE
        !          1071: .ps 10
        !          1072: .vs 12
        !          1073: will perform the other interpretation.
        !          1074: Note that the expressions for the two cases might more easily
        !          1075: be written
        !          1076: .TS
        !          1077: center;
        !          1078: l l.
        !          1079: =\(mi/[A\-Za\-z]
        !          1080: .TE
        !          1081: in the first case and
        !          1082: .TS
        !          1083: center;
        !          1084: l.
        !          1085: =/\-[A\-Za\-z]
        !          1086: .TE
        !          1087: in the second;
        !          1088: no backup would be required in the rule action.
        !          1089: It is not necessary to recognize the whole identifier
        !          1090: to observe the ambiguity.
        !          1091: The
        !          1092: possibility of ``=\(mi3'', however, makes
        !          1093: .TS
        !          1094: center;
        !          1095: l.
        !          1096: =\(mi/[^ \et\en]
        !          1097: .TE
        !          1098: a still better rule.
        !          1099: .PP
        !          1100: In addition to these routines, Lex also permits
        !          1101: access to the I/O routines
        !          1102: it uses.
        !          1103: They are:
        !          1104: .IP 1)
        !          1105: .I
        !          1106: input()
        !          1107: .R
        !          1108: which returns the next input character;
        !          1109: .IP 2)
        !          1110: .I
        !          1111: output(c)
        !          1112: .R
        !          1113: which writes the character
        !          1114: .I
        !          1115: c
        !          1116: .R
        !          1117: on the output; and
        !          1118: .IP 3)
        !          1119: .I
        !          1120: unput(c)
        !          1121: .R
        !          1122: pushes the character
        !          1123: .I
        !          1124: c
        !          1125: .R
        !          1126: back onto the input stream to be read later by
        !          1127: .I
        !          1128: input().
        !          1129: .R
        !          1130: .LP
        !          1131: By default these routines are provided as macro definitions,
        !          1132: but the user can override them and supply private versions.
        !          1133: These routines
        !          1134: define the relationship between external files and
        !          1135: internal characters, and must all be retained
        !          1136: or modified consistently.
        !          1137: They may be redefined, to
        !          1138: cause input or output to be transmitted to or from strange
        !          1139: places, including other programs or internal memory;
        !          1140: but the character set used must be consistent in all routines;
        !          1141: a value of zero returned by
        !          1142: .I
        !          1143: input
        !          1144: .R
        !          1145: must mean end of file; and
        !          1146: the relationship between
        !          1147: .I
        !          1148: unput
        !          1149: .R
        !          1150: and
        !          1151: .I
        !          1152: input
        !          1153: .R
        !          1154: must be retained
        !          1155: or the Lex look~ahead will not work.
        !          1156: Lex does not look ahead at all if it does not have to,
        !          1157: but every rule ending in
        !          1158: .ft I
        !          1159: + \(** ?
        !          1160: .ft R
        !          1161: or
        !          1162: .ft I
        !          1163: $
        !          1164: .ft R
        !          1165: or containing
        !          1166: .ft I
        !          1167: /
        !          1168: .ft R
        !          1169: implies look~ahead.
        !          1170: Look~ahead is also necessary to match an expression that is a prefix
        !          1171: of another expression.
        !          1172: See below for a discussion of the character set used by Lex.
        !          1173: The standard Lex library imposes
        !          1174: a 100 character limit on backup.
        !          1175: .PP
        !          1176: Another Lex library routine that the user will sometimes want
        !          1177: to redefine is
        !          1178: .I
        !          1179: yywrap()
        !          1180: .R
        !          1181: which is called whenever Lex reaches an end-of-file.
        !          1182: If
        !          1183: .I
        !          1184: yywrap
        !          1185: .R
        !          1186: returns a 1, Lex continues with the normal wrapup on end of input.
        !          1187: Sometimes, however, it is convenient to arrange for more
        !          1188: input to arrive
        !          1189: from a new source.
        !          1190: In this case, the user should provide
        !          1191: a
        !          1192: .I
        !          1193: yywrap
        !          1194: .R
        !          1195: which
        !          1196: arranges for new input and
        !          1197: returns 0.  This instructs Lex to continue processing.
        !          1198: The default
        !          1199: .I
        !          1200: yywrap
        !          1201: .R
        !          1202: always returns 1.
        !          1203: .PP
        !          1204: This routine is also a convenient place
        !          1205: to print tables, summaries, etc. at the end
        !          1206: of a program.  Note that it is not
        !          1207: possible to write a normal rule which recognizes
        !          1208: end-of-file; the only access to this condition is
        !          1209: through
        !          1210: .I
        !          1211: yywrap.
        !          1212: .R
        !          1213: In fact, unless a private version of
        !          1214: .I
        !          1215: input()
        !          1216: .R
        !          1217: is supplied
        !          1218: a file containing nulls
        !          1219: cannot be handled,
        !          1220: since a value of 0 returned by
        !          1221: .I
        !          1222: input
        !          1223: .R
        !          1224: is taken to be end-of-file.
        !          1225: .PP
        !          1226: .2C
        !          1227: .NH
        !          1228: Ambiguous Source Rules.
        !          1229: .PP
        !          1230: Lex can handle ambiguous specifications.
        !          1231: When more than one expression can match the
        !          1232: current input, Lex chooses as follows:
        !          1233: .IP 1)
        !          1234: The longest match is preferred.
        !          1235: .IP 2)
        !          1236: Among rules which matched the same number of characters,
        !          1237: the rule given first is preferred.
        !          1238: .LP
        !          1239: Thus, suppose the rules
        !          1240: .TS
        !          1241: center;
        !          1242: l l.
        !          1243: integer        keyword action ...;
        !          1244: [a\-z]+        identifier action ...;
        !          1245: .TE
        !          1246: to be given in that order.  If the input is
        !          1247: .I integers ,
        !          1248: it is taken as an identifier, because
        !          1249: .I [a\-z]+
        !          1250: matches 8 characters while
        !          1251: .I integer
        !          1252: matches only 7.
        !          1253: If the input is
        !          1254: .I integer ,
        !          1255: both rules match 7 characters, and
        !          1256: the keyword rule is selected because it was given first.
        !          1257: Anything shorter (e.g. \fIint\fR\|) will
        !          1258: not match the expression
        !          1259: .I integer
        !          1260: and so the identifier interpretation is used.
        !          1261: .PP
        !          1262: The principle of preferring the longest
        !          1263: match makes rules containing
        !          1264: expressions like
        !          1265: .I \&.\(**
        !          1266: dangerous.
        !          1267: For example,
        !          1268: .TS
        !          1269: center;
        !          1270: l.
        !          1271: \&\(fm.\(**\(fm
        !          1272: .TE
        !          1273: might seem a good way of recognizing
        !          1274: a string in single quotes.
        !          1275: But it is an invitation for the program to read far
        !          1276: ahead, looking for a distant
        !          1277: single quote.
        !          1278: Presented with the input
        !          1279: .TS
        !          1280: center;
        !          1281: l l.
        !          1282: \&\(fmfirst\(fm quoted string here, \(fmsecond\(fm here
        !          1283: .TE
        !          1284: the above expression will match
        !          1285: .TS
        !          1286: center;
        !          1287: l l.
        !          1288: \&\(fmfirst\(fm quoted string here, \(fmsecond\(fm
        !          1289: .TE
        !          1290: which is probably not what was wanted.
        !          1291: A better rule is of the form
        !          1292: .TS
        !          1293: center;
        !          1294: l.
        !          1295: \&\(fm[^\(fm\en]\(**\(fm
        !          1296: .TE
        !          1297: which, on the above input, will stop
        !          1298: after
        !          1299: .I \(fmfirst\(fm .
        !          1300: The consequences
        !          1301: of errors like this are mitigated by the fact
        !          1302: that the
        !          1303: .I \&.
        !          1304: operator will not match newline.
        !          1305: Thus expressions like
        !          1306: .I \&.\(**
        !          1307: stop on the
        !          1308: current line.
        !          1309: Don't try to defeat this with expressions like
        !          1310: .I (.|\en)+
        !          1311: or
        !          1312: equivalents;
        !          1313: the Lex generated program will try to read
        !          1314: the entire input file, causing
        !          1315: internal buffer overflows.
        !          1316: .PP
        !          1317: Note that Lex is normally partitioning
        !          1318: the input stream, not searching for all possible matches
        !          1319: of each expression.
        !          1320: This means that each character is accounted for
        !          1321: once and only once.
        !          1322: For example, suppose it is desired to
        !          1323: count occurrences of both \fIshe\fR and \fIhe\fR in an input text.
        !          1324: Some Lex rules to do this might be
        !          1325: .TS
        !          1326: center;
        !          1327: l l.
        !          1328: she    s++;
        !          1329: he     h++;
        !          1330: \en    |
        !          1331: \&.    ;
        !          1332: .TE
        !          1333: where the last two rules ignore everything besides \fIhe\fR and \fIshe\fR.
        !          1334: Remember that . does not include newline.
        !          1335: Since \fIshe\fR includes \fIhe\fR, Lex will normally
        !          1336: .I
        !          1337: not
        !          1338: .R
        !          1339: recognize
        !          1340: the instances of \fIhe\fR included in \fIshe\fR,
        !          1341: since once it has passed a \fIshe\fR those characters are gone.
        !          1342: .PP
        !          1343: Sometimes the user would like to override this choice.  The action
        !          1344: REJECT
        !          1345: means ``go do the next alternative.''
        !          1346: It causes whatever rule was second choice after the current
        !          1347: rule to be executed.
        !          1348: The position of the input pointer is adjusted accordingly.
        !          1349: Suppose the user really wants to count the included instances of \fIhe\fR:
        !          1350: .TS
        !          1351: center;
        !          1352: l l.
        !          1353: she    {s++; REJECT;}
        !          1354: he     {h++; REJECT;}
        !          1355: \en    |
        !          1356: \&.    ;
        !          1357: .TE
        !          1358: these rules are one way of changing the previous example
        !          1359: to do just that.
        !          1360: After counting each expression, it is rejected; whenever appropriate,
        !          1361: the other expression will then be counted.  In this example, of course,
        !          1362: the user could note that \fIshe\fR includes \fIhe\fR but not
        !          1363: vice versa, and omit the REJECT action on \fIhe\fR;
        !          1364: in other cases, however, it
        !          1365: would not be possible a priori to tell
        !          1366: which input characters
        !          1367: were in both classes.
        !          1368: .PP
        !          1369: Consider the two rules
        !          1370: .TS
        !          1371: center;
        !          1372: l l.
        !          1373: a[bc]+ { ... ; REJECT;}
        !          1374: a[cd]+ { ... ; REJECT;}
        !          1375: .TE
        !          1376: If the input is
        !          1377: .I ab ,
        !          1378: only the first rule matches,
        !          1379: and on
        !          1380: .I ad
        !          1381: only the second matches.
        !          1382: The input string
        !          1383: .I accb
        !          1384: matches the first rule for four characters
        !          1385: and then the second rule for three characters.
        !          1386: In contrast, the input
        !          1387: .I accd
        !          1388: agrees with
        !          1389: the second rule for four characters and then the first
        !          1390: rule for three.
        !          1391: .PP
        !          1392: In general, REJECT is useful whenever
        !          1393: the purpose of Lex is not to partition the input
        !          1394: stream but to detect all examples of some items
        !          1395: in the input, and the instances of these items
        !          1396: may overlap or include each other.
        !          1397: Suppose a digram table of the input is desired;
        !          1398: normally the digrams overlap, that is the word
        !          1399: .I the
        !          1400: is considered to contain
        !          1401: both
        !          1402: .I th
        !          1403: and
        !          1404: .I he .
        !          1405: Assuming a two-dimensional array named
        !          1406: .ul
        !          1407: digram
        !          1408: to be incremented, the appropriate
        !          1409: source is
        !          1410: .TS
        !          1411: center;
        !          1412: l l.
        !          1413: %%
        !          1414: [a\-z][a\-z]   {
        !          1415:        digram[yytext[0]][yytext[1]]++;
        !          1416:        REJECT;
        !          1417:        }
        !          1418: \.     ;
        !          1419: \en    ;
        !          1420: .TE
        !          1421: where the REJECT is necessary to pick up
        !          1422: a letter pair beginning at every character, rather than at every
        !          1423: other character.
        !          1424: .2C
        !          1425: .NH
        !          1426: Lex Source Definitions.
        !          1427: .PP
        !          1428: Remember the format of the Lex
        !          1429: source:
        !          1430: .TS
        !          1431: center;
        !          1432: l.
        !          1433: {definitions}
        !          1434: %%
        !          1435: {rules}
        !          1436: %%
        !          1437: {user routines}
        !          1438: .TE
        !          1439: So far only the rules have been described.  The user needs
        !          1440: additional options,
        !          1441: though, to define variables for use in his program and for use
        !          1442: by Lex.
        !          1443: These can go either in the definitions section
        !          1444: or in the rules section.
        !          1445: .PP
        !          1446: Remember that Lex is turning the rules into a program.
        !          1447: Any source not intercepted by Lex is copied
        !          1448: into the generated program.  There are three classes
        !          1449: of such things.
        !          1450: .IP 1)
        !          1451: Any line which is not part of a Lex rule or action
        !          1452: which begins with a blank or tab is copied into
        !          1453: the Lex generated program.
        !          1454: Such source input prior to the first %% delimiter will be external
        !          1455: to any function in the code; if it appears immediately after the first
        !          1456: %%,
        !          1457: it appears in an appropriate place for declarations
        !          1458: in the function written by Lex which contains the actions.
        !          1459: This material must look like program fragments,
        !          1460: and should precede the first Lex rule.
        !          1461: .IP
        !          1462: As a side effect of the above, lines which begin with a blank
        !          1463: or tab, and which contain a comment,
        !          1464: are passed through to the generated program.
        !          1465: This can be used to include comments in either the Lex source or
        !          1466: the generated code.  The comments should follow the host
        !          1467: language convention.
        !          1468: .IP 2)
        !          1469: Anything included between lines containing
        !          1470: only
        !          1471: .I %{
        !          1472: and
        !          1473: .I %}
        !          1474: is
        !          1475: copied out as above.  The delimiters are discarded.
        !          1476: This format permits entering text like preprocessor statements that
        !          1477: must begin in column 1,
        !          1478: or copying lines that do not look like programs.
        !          1479: .IP 3)
        !          1480: Anything after the third %% delimiter, regardless of formats, etc.,
        !          1481: is copied out after the Lex output.
        !          1482: .PP
        !          1483: Definitions intended for Lex are given
        !          1484: before the first %% delimiter.  Any line in this section
        !          1485: not contained between %{ and %}, and begining
        !          1486: in column 1, is assumed to define Lex substitution strings.
        !          1487: The format of such lines is
        !          1488: .TS
        !          1489: center;
        !          1490: l l.
        !          1491: name translation
        !          1492: .TE
        !          1493: and it
        !          1494: causes the string given as a translation to
        !          1495: be associated with the name.
        !          1496: The name and translation
        !          1497: must be separated by at least one blank or tab, and the name must begin with a letter.
        !          1498: The translation can then be called out
        !          1499: by the {name} syntax in a rule.
        !          1500: Using {D} for the digits and {E} for an exponent field,
        !          1501: for example, might abbreviate rules to recognize numbers:
        !          1502: .TS
        !          1503: center;
        !          1504: l l.
        !          1505: D      [0\-9]
        !          1506: E      [DEde][\-+]?{D}+
        !          1507: %%
        !          1508: {D}+   printf("integer");
        !          1509: {D}+"."{D}\(**({E})?   |
        !          1510: {D}\(**"."{D}+({E})?   |
        !          1511: {D}+{E}                printf("real");
        !          1512: .TE
        !          1513: Note the first two rules for real numbers;
        !          1514: both require a decimal point and contain
        !          1515: an optional exponent field,
        !          1516: but the first requires at least one digit before the
        !          1517: decimal point and the second requires at least one
        !          1518: digit after the decimal point.
        !          1519: To correctly handle the problem
        !          1520: posed by a Fortran expression such as
        !          1521: .I 35.EQ.I ,
        !          1522: which does not contain a real number, a context-sensitive
        !          1523: rule such as
        !          1524: .TS
        !          1525: center;
        !          1526: l l.
        !          1527: [0\-9]+/"."EQ  printf("integer");
        !          1528: .TE
        !          1529: could be used in addition to the normal rule for integers.
        !          1530: .PP
        !          1531: The definitions
        !          1532: section may also contain other commands, including the
        !          1533: selection of a host language, a character set table,
        !          1534: a list of start conditions, or adjustments to the default
        !          1535: size of arrays within Lex itself for larger source programs.
        !          1536: These possibilities
        !          1537: are discussed below under ``Summary of Source Format,''
        !          1538: section 12.
        !          1539: .2C
        !          1540: .NH
        !          1541: Usage.
        !          1542: .PP
        !          1543: There are two steps in
        !          1544: compiling a Lex source program.
        !          1545: First, the Lex source must be turned into a generated program
        !          1546: in the host general purpose language.
        !          1547: Then this program must be compiled and loaded, usually with
        !          1548: a library of Lex subroutines.
        !          1549: The generated program
        !          1550: is on a file named
        !          1551: .I lex.yy.c .
        !          1552: The I/O library is defined in terms of the C standard
        !          1553: library [6].
        !          1554: .PP
        !          1555: The C programs generated by Lex are slightly different
        !          1556: on OS/370, because the
        !          1557: OS compiler is less powerful than the UNIX or GCOS compilers,
        !          1558: and does less at compile time.
        !          1559: C programs generated on GCOS and UNIX are the same.
        !          1560: .PP
        !          1561: .I
        !          1562: UNIX.
        !          1563: .R
        !          1564: The library is accessed by the loader flag
        !          1565: .I \-ll .
        !          1566: So an appropriate
        !          1567: set of commands is
        !          1568: .KS
        !          1569: .in 5
        !          1570: lex source
        !          1571: cc lex.yy.c \-ll
        !          1572: .in 0
        !          1573: .KE
        !          1574: The resulting program is placed on the usual file
        !          1575: .I
        !          1576: a.out
        !          1577: .R
        !          1578: for later execution.
        !          1579: To use Lex with Yacc see below.
        !          1580: Although the default Lex I/O routines use the C standard library,
        !          1581: the Lex automata themselves do not do so;
        !          1582: if private versions of
        !          1583: .I
        !          1584: input,
        !          1585: output
        !          1586: .R
        !          1587: and
        !          1588: .I unput
        !          1589: are given, the library can be avoided.
        !          1590: .PP
        !          1591: .2C
        !          1592: .NH
        !          1593: Lex and Yacc.
        !          1594: .PP
        !          1595: If you want to use Lex with Yacc, note that what Lex writes is a program
        !          1596: named
        !          1597: .I
        !          1598: yylex(),
        !          1599: .R
        !          1600: the name required by Yacc for its analyzer.
        !          1601: Normally, the default main program on the Lex library
        !          1602: calls this routine, but if Yacc is loaded, and its main
        !          1603: program is used, Yacc will call
        !          1604: .I
        !          1605: yylex().
        !          1606: .R
        !          1607: In this case each Lex rule should end with
        !          1608: .TS
        !          1609: center;
        !          1610: l.
        !          1611: return(token);
        !          1612: .TE
        !          1613: where the appropriate token value is returned.
        !          1614: An easy way to get access
        !          1615: to Yacc's names for tokens is to
        !          1616: compile the Lex output file as part of
        !          1617: the Yacc output file by placing the line
        !          1618: .TS
        !          1619: center;
        !          1620: l.
        !          1621: # include "lex.yy.c"
        !          1622: .TE
        !          1623: in the last section of Yacc input.
        !          1624: Supposing the grammar to be
        !          1625: named ``good'' and the lexical rules to be named ``better''
        !          1626: the UNIX command sequence can just be:
        !          1627: .TS
        !          1628: center;
        !          1629: l.
        !          1630: yacc good
        !          1631: lex better
        !          1632: cc y.tab.c \-ly \-ll
        !          1633: .TE
        !          1634: The Yacc library (\-ly) should be loaded before the Lex library,
        !          1635: to obtain a main program which invokes the Yacc parser.
        !          1636: The generations of Lex and Yacc programs can be done in
        !          1637: either order.
        !          1638: .2C
        !          1639: .NH
        !          1640: Examples.
        !          1641: .PP
        !          1642: As a trivial problem, consider copying an input file while
        !          1643: adding 3 to every positive number divisible by 7.
        !          1644: Here is a suitable Lex source program
        !          1645: .TS
        !          1646: center;
        !          1647: l l.
        !          1648: %%
        !          1649:        int k;
        !          1650: [0\-9]+        {
        !          1651:        k = atoi(yytext);
        !          1652:        if (k%7 == 0)
        !          1653:             printf("%d", k+3);
        !          1654:        else
        !          1655:             printf("%d",k);
        !          1656:        }
        !          1657: .TE
        !          1658: to do just that.
        !          1659: The rule [0\-9]+ recognizes strings of digits;
        !          1660: .I
        !          1661: atoi
        !          1662: .R
        !          1663: converts the digits to binary
        !          1664: and stores the result in
        !          1665: .ul
        !          1666: k.
        !          1667: The operator % (remainder) is used to check whether
        !          1668: .ul
        !          1669: k
        !          1670: is divisible by 7; if it is,
        !          1671: it is incremented by 3 as it is written out.
        !          1672: It may be objected that this program will alter such
        !          1673: input items as
        !          1674: .I 49.63
        !          1675: or
        !          1676: .I X7 .
        !          1677: Furthermore, it increments the absolute value
        !          1678: of all negative numbers divisible by 7.
        !          1679: To avoid this, just add a few more rules after the active one,
        !          1680: as here:
        !          1681: .TS
        !          1682: center;
        !          1683: l l.
        !          1684: %%
        !          1685:        int k;
        !          1686: \-?[0\-9]+     {
        !          1687:        k = atoi(yytext);
        !          1688:        printf("%d",
        !          1689:          k%7 == 0 ? k+3 : k);
        !          1690:        }
        !          1691: \-?[0\-9.]+    ECHO;
        !          1692: [A-Za-z][A-Za-z0-9]+   ECHO;
        !          1693: .TE
        !          1694: Numerical strings containing
        !          1695: a ``.'' or preceded by a letter will be picked up by
        !          1696: one of the last two rules, and not changed.
        !          1697: The
        !          1698: .I if\-else
        !          1699: has been replaced by
        !          1700: a C conditional expression to save space;
        !          1701: the form
        !          1702: .ul
        !          1703: a?b:c
        !          1704: means ``if
        !          1705: .I a
        !          1706: then
        !          1707: .I b
        !          1708: else
        !          1709: .I c ''.
        !          1710: .PP
        !          1711: For an example of statistics gathering, here
        !          1712: is a program which histograms the lengths
        !          1713: of words, where a word is defined as a string of letters.
        !          1714: .TS
        !          1715: center;
        !          1716: l l.
        !          1717:        int lengs[100];
        !          1718: %%
        !          1719: [a\-z]+        lengs[yyleng]++;
        !          1720: \&.    |
        !          1721: \en    ;
        !          1722: %%
        !          1723: .T&
        !          1724: l s.
        !          1725: yywrap()
        !          1726: {
        !          1727: int i;
        !          1728: printf("Length  No. words\en");
        !          1729: for(i=0; i<100; i++)
        !          1730:      if (lengs[i] > 0)
        !          1731:           printf("%5d%10d\en",i,lengs[i]);
        !          1732: return(1);
        !          1733: }
        !          1734: .TE
        !          1735: This program
        !          1736: accumulates the histogram, while producing no output.  At the end
        !          1737: of the input it prints the table.
        !          1738: The final statement
        !          1739: .I
        !          1740: return(1);
        !          1741: .R
        !          1742: indicates that Lex is to perform wrapup.  If
        !          1743: .I
        !          1744: yywrap
        !          1745: .R
        !          1746: returns zero (false)
        !          1747: it implies that further input is available
        !          1748: and the program is
        !          1749: to continue reading and processing.
        !          1750: To provide a
        !          1751: .I
        !          1752: yywrap
        !          1753: .R
        !          1754: that never
        !          1755: returns true causes an infinite loop.
        !          1756: .PP
        !          1757: As a larger example,
        !          1758: here are some parts of a program written by N. L. Schryer
        !          1759: to convert double precision Fortran to single precision Fortran.
        !          1760: Because Fortran does not distinguish upper and lower case letters,
        !          1761: this routine begins by defining a set of classes including
        !          1762: both cases of each letter:
        !          1763: .TS
        !          1764: center;
        !          1765: l l.
        !          1766: a      [aA]
        !          1767: b      [bB]
        !          1768: c      [cC]
        !          1769: \&...
        !          1770: z      [zZ]
        !          1771: .TE
        !          1772: An additional class recognizes white space:
        !          1773: .TS
        !          1774: center;
        !          1775: l l.
        !          1776: W      [ \et]\(**
        !          1777: .TE
        !          1778: The first rule changes
        !          1779: ``double precision'' to ``real'', or ``DOUBLE PRECISION'' to ``REAL''.
        !          1780: .TS
        !          1781: center;
        !          1782: l.
        !          1783: {d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n} {
        !          1784:      printf(yytext[0]==\(fmd\(fm? "real" : "REAL");
        !          1785:      }
        !          1786: .TE
        !          1787: Care is taken throughout this program to preserve the case
        !          1788: (upper or lower)
        !          1789: of the original program.
        !          1790: The conditional operator is used to
        !          1791: select the proper form of the keyword.
        !          1792: The next rule copies continuation card indications to
        !          1793: avoid confusing them with constants:
        !          1794: .TS
        !          1795: center;
        !          1796: l l.
        !          1797: ^"     "[^ 0]  ECHO;
        !          1798: .TE
        !          1799: In the regular expression, the quotes surround the
        !          1800: blanks.
        !          1801: It is interpreted as
        !          1802: ``beginning of line, then five blanks, then
        !          1803: anything but blank or zero.'' 
        !          1804: Note the two different meanings of
        !          1805: .I ^ .
        !          1806: There follow some rules to change double precision
        !          1807: constants to ordinary floating constants.
        !          1808: .TS
        !          1809: center;
        !          1810: l.
        !          1811: [0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+     |
        !          1812: [0\-9]+{W}"."{W}{d}{W}[+\-]?{W}[0\-9]+     |
        !          1813: "."{W}[0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+     {
        !          1814:      /\(** convert constants \(**/
        !          1815:      for(p=yytext; \(**p != 0; p++)
        !          1816:           {
        !          1817:           if (\(**p == \(fmd\(fm || \(**p == \(fmD\(fm)
        !          1818:                \(**p=+ \(fme\(fm\- \(fmd\(fm;
        !          1819:           ECHO;
        !          1820:           }
        !          1821: .TE
        !          1822: After the floating point constant is recognized, it is
        !          1823: scanned by the
        !          1824: .ul
        !          1825: for
        !          1826: loop
        !          1827: to find the letter
        !          1828: .I d
        !          1829: or
        !          1830: .I D .
        !          1831: The program than adds
        !          1832: .c
        !          1833: .I \(fme\(fm\-\(fmd\(fm ,
        !          1834: which converts
        !          1835: it to the next letter of the alphabet.
        !          1836: The modified constant, now single-precision,
        !          1837: is written out again.
        !          1838: There follow a series of names which must be respelled to remove
        !          1839: their initial \fId\fR.
        !          1840: By using the
        !          1841: array
        !          1842: .I
        !          1843: yytext
        !          1844: .R
        !          1845: the same action suffices for all the names (only a sample of
        !          1846: a rather long list is given here).
        !          1847: .TS
        !          1848: center;
        !          1849: l l.
        !          1850: {d}{s}{i}{n}   |
        !          1851: {d}{c}{o}{s}   |
        !          1852: {d}{s}{q}{r}{t}        |
        !          1853: {d}{a}{t}{a}{n}        |
        !          1854: \&...
        !          1855: {d}{f}{l}{o}{a}{t}     printf("%s",yytext+1);
        !          1856: .TE
        !          1857: Another list of names must have initial \fId\fR changed to initial \fIa\fR:
        !          1858: .TS
        !          1859: center;
        !          1860: l l.
        !          1861: {d}{l}{o}{g}   |
        !          1862: {d}{l}{o}{g}10 |
        !          1863: {d}{m}{i}{n}1  |
        !          1864: {d}{m}{a}{x}1  {
        !          1865:        yytext[0] =+ \(fma\(fm \- \(fmd\(fm;
        !          1866:        ECHO;
        !          1867:        }
        !          1868: .TE
        !          1869: And one routine
        !          1870: must have initial \fId\fR changed to initial \fIr\fR:
        !          1871: .TS
        !          1872: center;
        !          1873: l l.
        !          1874: {d}1{m}{a}{c}{h}       {yytext[0] =+ \(fmr\(fm  \- \(fmd\(fm;
        !          1875:                ECHO;
        !          1876:                }
        !          1877: .TE
        !          1878: To avoid such names as \fIdsinx\fR being detected as instances
        !          1879: of \fIdsin\fR, some final rules pick up longer words as identifiers
        !          1880: and copy some surviving characters:
        !          1881: .TS
        !          1882: center;
        !          1883: l l.
        !          1884: [A\-Za\-z][A\-Za\-z0\-9]\(**   |
        !          1885: [0\-9]+        |
        !          1886: \en    |
        !          1887: \&.    ECHO;
        !          1888: .TE
        !          1889: Note that this program is not complete; it
        !          1890: does not deal with the spacing problems in Fortran or
        !          1891: with the use of keywords as identifiers.
        !          1892: .br
        !          1893: .2C
        !          1894: .NH
        !          1895: Left Context Sensitivity.
        !          1896: .PP
        !          1897: Sometimes
        !          1898: it is desirable to have several sets of lexical rules
        !          1899: to be applied at different times in the input.
        !          1900: For example, a compiler preprocessor might distinguish
        !          1901: preprocessor statements and analyze them differently
        !          1902: from ordinary statements.
        !          1903: This requires
        !          1904: sensitivity
        !          1905: to prior context, and there are several ways of handling
        !          1906: such problems.
        !          1907: The \fI^\fR operator, for example, is a prior context operator,
        !          1908: recognizing immediately preceding left context just as \fI$\fR recognizes
        !          1909: immediately following right context.
        !          1910: Adjacent left context could be extended, to produce a facility similar to
        !          1911: that for adjacent right context, but it is unlikely
        !          1912: to be as useful, since often the relevant left context
        !          1913: appeared some time earlier, such as at the beginning of a line.
        !          1914: .PP
        !          1915: This section describes three means of dealing
        !          1916: with different environments: a simple use of flags,
        !          1917: when only a few rules change from one environment to another,
        !          1918: the use of
        !          1919: .I
        !          1920: start conditions
        !          1921: .R
        !          1922: on rules,
        !          1923: and the possibility of making multiple lexical analyzers all run
        !          1924: together.
        !          1925: In each case, there are rules which recognize the need to change the
        !          1926: environment in which the
        !          1927: following input text is analyzed, and set some parameter
        !          1928: to reflect the change.  This may be a flag explicitly tested by
        !          1929: the user's action code; such a flag is the simplest way of dealing
        !          1930: with the problem, since Lex is not involved at all.
        !          1931: It may be more convenient,
        !          1932: however,
        !          1933: to have Lex remember the flags as initial conditions on the rules.
        !          1934: Any rule may be associated with a start condition.  It will only
        !          1935: be recognized when Lex is in
        !          1936: that start condition.
        !          1937: The current start condition may be changed at any time.
        !          1938: Finally, if the sets of rules for the different environments
        !          1939: are very dissimilar,
        !          1940: clarity may be best achieved by writing several distinct lexical
        !          1941: analyzers, and switching from one to another as desired.
        !          1942: .PP
        !          1943: Consider the following problem: copy the input to the output,
        !          1944: changing the word \fImagic\fR to \fIfirst\fR on every line which began
        !          1945: with the letter \fIa\fR, changing \fImagic\fR to \fIsecond\fR on every line
        !          1946: which began with the letter \fIb\fR, and changing
        !          1947: \fImagic\fR to \fIthird\fR on every line which began
        !          1948: with the letter \fIc\fR.  All other words and all other lines
        !          1949: are left unchanged.
        !          1950: .PP
        !          1951: These rules are so simple that the easiest way
        !          1952: to do this job is with a flag:
        !          1953: .TS
        !          1954: center;
        !          1955: l l.
        !          1956:        int flag;
        !          1957: %%
        !          1958: ^a     {flag = \(fma\(fm; ECHO;}
        !          1959: ^b     {flag = \(fmb\(fm; ECHO;}
        !          1960: ^c     {flag = \(fmc\(fm; ECHO;}
        !          1961: \en    {flag =  0 ; ECHO;}
        !          1962: magic  {
        !          1963:        switch (flag)
        !          1964:        {
        !          1965:        case \(fma\(fm: printf("first"); break;
        !          1966:        case \(fmb\(fm: printf("second"); break;
        !          1967:      case \(fmc\(fm: printf("third"); break;
        !          1968:        default: ECHO; break;
        !          1969:        }
        !          1970:        }
        !          1971: .TE
        !          1972: should be adequate.
        !          1973: .PP
        !          1974: To handle the same problem with start conditions, each
        !          1975: start condition must be introduced to Lex in the definitions section
        !          1976: with a line reading
        !          1977: .TS
        !          1978: center;
        !          1979: l l.
        !          1980: %Start name1 name2 ...
        !          1981: .TE
        !          1982: where the conditions may be named in any order.
        !          1983: The word \fIStart\fR may be abbreviated to \fIs\fR or \fIS\fR.
        !          1984: The conditions may be referenced at the
        !          1985: head of a rule with the <> brackets:
        !          1986: .TS
        !          1987: center;
        !          1988: l.
        !          1989: <name1>expression
        !          1990: .TE
        !          1991: is a rule which is only recognized when Lex is in the
        !          1992: start condition \fIname1\fR.
        !          1993: To enter a start condition,
        !          1994: execute the action statement
        !          1995: .TS
        !          1996: center;
        !          1997: l.
        !          1998: BEGIN name1;
        !          1999: .TE
        !          2000: which changes the start condition to \fIname1\fR.
        !          2001: To resume the normal state,
        !          2002: .TS
        !          2003: center;
        !          2004: l.
        !          2005: BEGIN 0;
        !          2006: .TE
        !          2007: resets the initial condition
        !          2008: of the Lex automaton interpreter.
        !          2009: A rule may be active in several
        !          2010: start conditions:
        !          2011: .TS
        !          2012: center;
        !          2013: l.
        !          2014: <name1,name2,name3>
        !          2015: .TE
        !          2016: is a legal prefix.  Any rule not beginning with the
        !          2017: <> prefix operator is always active.
        !          2018: .PP
        !          2019: The same example as before can be written:
        !          2020: .TS
        !          2021: center;
        !          2022: l l.
        !          2023: %START AA BB CC
        !          2024: %%
        !          2025: ^a     {ECHO; BEGIN AA;}
        !          2026: ^b     {ECHO; BEGIN BB;}
        !          2027: ^c     {ECHO; BEGIN CC;}
        !          2028: \en    {ECHO; BEGIN 0;}
        !          2029: <AA>magic      printf("first");
        !          2030: <BB>magic      printf("second");
        !          2031: <CC>magic      printf("third");
        !          2032: .TE
        !          2033: where the logic is exactly the same as in the previous
        !          2034: method of handling the problem, but Lex does the work
        !          2035: rather than the user's code.
        !          2036: .2C
        !          2037: .NH
        !          2038: Character Set.
        !          2039: .PP
        !          2040: The programs generated by Lex handle
        !          2041: character I/O only through the routines
        !          2042: .I
        !          2043: input,
        !          2044: output,
        !          2045: .R
        !          2046: and
        !          2047: .I
        !          2048: unput.
        !          2049: .R
        !          2050: Thus the character representation
        !          2051: provided in these routines
        !          2052: is accepted by Lex and employed to return
        !          2053: values in
        !          2054: .I
        !          2055: yytext.
        !          2056: .R
        !          2057: For internal use
        !          2058: a character is represented as a small integer
        !          2059: which, if the standard library is used,
        !          2060: has a value equal to the integer value of the bit
        !          2061: pattern representing the character on the host computer.
        !          2062: Normally, the letter
        !          2063: .I a
        !          2064: is represented as the same form as the character constant
        !          2065: .I \(fma\(fm .
        !          2066: If this interpretation is changed, by providing I/O
        !          2067: routines which translate the characters,
        !          2068: Lex must be told about
        !          2069: it, by giving a translation table.
        !          2070: This table must be in the definitions section,
        !          2071: and must be bracketed by lines containing  only
        !          2072: ``%T''.
        !          2073: The table contains lines of the form
        !          2074: .TS
        !          2075: center;
        !          2076: l.
        !          2077: {integer} {character string}
        !          2078: .TE
        !          2079: which indicate the value associated with each character.
        !          2080: Thus the next example
        !          2081: .GS 2
        !          2082: .TS
        !          2083: center;
        !          2084: l l.
        !          2085: %T
        !          2086:  1     Aa
        !          2087:  2     Bb
        !          2088: \&...
        !          2089: 26     Zz
        !          2090: 27     \en
        !          2091: 28     +
        !          2092: 29     \-
        !          2093: 30     0
        !          2094: 31     1
        !          2095: \&...
        !          2096: 39     9
        !          2097: %T
        !          2098: .TE
        !          2099: .sp
        !          2100: .ce 1
        !          2101: Sample character table.
        !          2102: .GE
        !          2103: maps the lower and upper case letters together into the integers 1 through 26,
        !          2104: newline into 27, + and \- into 28 and 29, and the
        !          2105: digits into 30 through 39.
        !          2106: Note the escape for newline.
        !          2107: If a table is supplied, every character that is to appear either
        !          2108: in the rules or in any valid input must be included
        !          2109: in the table.
        !          2110: No character
        !          2111: may be assigned the number 0, and no character may be
        !          2112: assigned a bigger number than the size of the hardware character set.
        !          2113: .2C
        !          2114: .NH
        !          2115: Summary of Source Format.
        !          2116: .PP
        !          2117: The general form of a Lex source file is:
        !          2118: .TS
        !          2119: center;
        !          2120: l.
        !          2121: {definitions}
        !          2122: %%
        !          2123: {rules}
        !          2124: %%
        !          2125: {user subroutines}
        !          2126: .TE
        !          2127: The definitions section contains
        !          2128: a combination of
        !          2129: .IP 1)
        !          2130: Definitions, in the form ``name space translation''.
        !          2131: .IP 2)
        !          2132: Included code, in the form ``space code''.
        !          2133: .IP 3)
        !          2134: Included code, in the form
        !          2135: .TS
        !          2136: center;
        !          2137: l.
        !          2138: %{
        !          2139: code
        !          2140: %}
        !          2141: .TE
        !          2142: .ns
        !          2143: .IP 4)
        !          2144: Start conditions, given in the form
        !          2145: .TS
        !          2146: center;
        !          2147: l.
        !          2148: %S name1 name2 ...
        !          2149: .TE
        !          2150: .ns
        !          2151: .IP 5)
        !          2152: Character set tables, in the form
        !          2153: .TS
        !          2154: center;
        !          2155: l.
        !          2156: %T
        !          2157: number space character-string
        !          2158: \&...
        !          2159: %T
        !          2160: .TE
        !          2161: .ns
        !          2162: .IP 6)
        !          2163: Changes to internal array sizes, in the form
        !          2164: .TS
        !          2165: center;
        !          2166: l.
        !          2167: %\fIx\fR\0\0\fInnn\fR
        !          2168: .TE
        !          2169: where \fInnn\fR is a decimal integer representing an array size
        !          2170: and \fIx\fR selects the parameter as follows:
        !          2171: .TS
        !          2172: center;
        !          2173: c c
        !          2174: c l.
        !          2175: Letter Parameter
        !          2176: p      positions
        !          2177: n      states
        !          2178: e      tree nodes
        !          2179: a      transitions
        !          2180: k      packed character classes
        !          2181: o      output array size
        !          2182: .TE
        !          2183: .LP
        !          2184: Lines in the rules section have the form ``expression  action''
        !          2185: where the action may be continued on succeeding
        !          2186: lines by using braces to delimit it.
        !          2187: .PP
        !          2188: Regular expressions in Lex use the following
        !          2189: operators:
        !          2190: .br
        !          2191: .TS
        !          2192: center;
        !          2193: l l.
        !          2194: x      the character "x"
        !          2195: "x"    an "x", even if x is an operator.
        !          2196: \ex    an "x", even if x is an operator.
        !          2197: [xy]   the character x or y.
        !          2198: [x\-z] the characters x, y or z.
        !          2199: [^x]   any character but x.
        !          2200: \&.    any character but newline.
        !          2201: ^x     an x at the beginning of a line.
        !          2202: <y>x   an x when Lex is in start condition y.
        !          2203: x$     an x at the end of a line.
        !          2204: x?     an optional x.
        !          2205: x\(**  0,1,2, ... instances of x.
        !          2206: x+     1,2,3, ... instances of x.
        !          2207: x|y    an x or a y.
        !          2208: (x)    an x.
        !          2209: x/y    an x but only if followed by y.
        !          2210: {xx}   the translation of xx from the
        !          2211:        definitions section.
        !          2212: x{m,n} \fIm\fR through \fIn\fR occurrences of x
        !          2213: .TE
        !          2214: .NH
        !          2215: Caveats and Bugs.
        !          2216: .PP
        !          2217: There are pathological expressions which
        !          2218: produce exponential growth of the tables when
        !          2219: converted to deterministic machines;
        !          2220: fortunately, they are rare.
        !          2221: .PP
        !          2222: REJECT does not rescan the input; instead it remembers the results of the previous
        !          2223: scan.  This means that if a rule with trailing context is found, and
        !          2224: REJECT executed, the user
        !          2225: must not have used
        !          2226: .ul
        !          2227: unput
        !          2228: to change the characters forthcoming
        !          2229: from the input stream.
        !          2230: This is the only restriction on the user's ability to manipulate
        !          2231: the not-yet-processed input.
        !          2232: .PP
        !          2233: .2C
        !          2234: .NH
        !          2235: Acknowledgments.
        !          2236: .PP
        !          2237: As should
        !          2238: be obvious from the above, the outside of Lex
        !          2239: is patterned
        !          2240: on Yacc and the inside on Aho's string matching routines.
        !          2241: Therefore, both S. C. Johnson and A. V. Aho
        !          2242: are really originators
        !          2243: of much of Lex,
        !          2244: as well as debuggers of it.
        !          2245: Many thanks are due to both.
        !          2246: .PP
        !          2247: The code of the current version of Lex was designed, written,
        !          2248: and debugged by Eric Schmidt.
        !          2249: .SG MH-1274-MEL-unix
        !          2250: .sp 1
        !          2251: .2C
        !          2252: .NH
        !          2253: References.
        !          2254: .SP 1v
        !          2255: .IP 1.
        !          2256: B. W. Kernighan and D. M. Ritchie,
        !          2257: .I
        !          2258: The C Programming Language,
        !          2259: .R
        !          2260: Prentice-Hall, N. J. (1978).
        !          2261: .IP 2.
        !          2262: B. W. Kernighan,
        !          2263: .I
        !          2264: Ratfor: A Preprocessor for a Rational Fortran,
        !          2265: .R
        !          2266: Software \- Practice and Experience,
        !          2267: \fB5\fR, pp. 395-496 (1975).
        !          2268: .IP 3.
        !          2269: S. C. Johnson,
        !          2270: .I
        !          2271: Yacc: Yet Another Compiler Compiler,
        !          2272: .R
        !          2273: Computing Science Technical Report No. 32,
        !          2274: 1975,
        !          2275: .MH
        !          2276: .if \n(tm (also TM 75-1273-6)
        !          2277: .IP 4.
        !          2278: A. V. Aho and M. J. Corasick,
        !          2279: .I
        !          2280: Efficient String Matching: An Aid to Bibliographic Search,
        !          2281: .R
        !          2282: Comm. ACM
        !          2283: .B
        !          2284: 18,
        !          2285: .R
        !          2286: 333-340 (1975).
        !          2287: .IP 5.
        !          2288: B. W. Kernighan, D. M. Ritchie and K. L. Thompson,
        !          2289: .I
        !          2290: QED Text Editor,
        !          2291: .R
        !          2292: Computing Science Technical Report No. 5,
        !          2293: 1972,
        !          2294: .MH
        !          2295: .IP 6.
        !          2296: D. M. Ritchie,
        !          2297: private communication.
        !          2298: See also
        !          2299: M. E. Lesk,
        !          2300: .I
        !          2301: The Portable C Library,
        !          2302: .R
        !          2303: Computing Science Technical Report No. 31,
        !          2304: .MH
        !          2305: .if \n(tm (also TM 75-1274-11)

unix.superglobalmegacorp.com

This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.