Annotation of 43BSDReno/share/doc/ps1/16.lex/lex.ms, revision 1.1.1.1

1.1       root        1: .\"    @(#)lex.ms      6.2 (Berkeley) 5/1/87
                      2: .\"
                      3: .EH 'PS1:16-%''Lex \- A Lexical Analyzer Generator'
                      4: .OH 'Lex \- A Lexical Analyzer Generator''PS1:16-%'
                      5: .hc ~
                      6: .bd I 2
                      7: .de TS
                      8: .br
                      9: .nf
                     10: .SP 1v
                     11: .ul 0
                     12: ..
                     13: .de TE
                     14: .SP 1v
                     15: .fi
                     16: ..
                     17: .\".de PT
                     18: .\".if \\n%>1 'tl ''\s7LEX\s0\s9\(mi%\s0''
                     19: .\".if \\n%>1 'sp
                     20: .\"..
                     21: .ND July 21, 1975
                     22: .\".RP
                     23: .\".TM 75-1274-15 39199 39199-11
                     24: .TL
                     25: Lex \- A Lexical Analyzer ~Generator~
                     26: .AU ``MH 2C-569'' 6377
                     27: M. E. Lesk and E. Schmidt
                     28: .AI
                     29: .MH
                     30: .AB
                     31: .sp
                     32: .bd I 2
                     33: .\".nr PS 8
                     34: .\".nr VS 9
                     35: .\".ps 8
                     36: .\".vs 9p
                     37: Lex helps write programs whose control flow
                     38: is directed by instances of regular
                     39: expressions in the input stream.
                     40: It is well suited for editor-script type transformations and
                     41: for segmenting input in preparation for
                     42: a parsing routine.
                     43: .PP
                     44: Lex source is a table of regular expressions and corresponding program fragments.
                     45: The table is translated to a program
                     46: which reads an input stream, copying it to an output stream
                     47: and partitioning the input
                     48: into strings which match the given expressions.
                     49: As each such string is recognized the corresponding
                     50: program fragment is executed.
                     51: The recognition of the expressions
                     52: is performed by a deterministic finite automaton
                     53: generated by Lex.
                     54: The program fragments written by the user are executed in the order in which the
                     55: corresponding regular expressions occur in the input stream.
                     56: .if n .if \n(tm .ig
                     57: .PP
                     58: The lexical analysis
                     59: programs written with Lex accept ambiguous specifications
                     60: and choose the longest
                     61: match possible at each input point.
                     62: If necessary, substantial look~ahead
                     63: is performed on the input, but the
                     64: input stream will be backed up to the
                     65: end of the current partition, so that the user
                     66: has general freedom to manipulate it.
                     67: .PP
                     68: Lex can generate analyzers in either C or Ratfor, a language
                     69: which can be translated automatically to portable Fortran.
                     70: It is available on the PDP-11 UNIX, Honeywell GCOS,
                     71: and IBM OS systems.
                     72: This manual, however, will only discuss generating analyzers
                     73: in C on the UNIX system, which is the only supported
                     74: form of Lex under UNIX Version 7.
                     75: Lex is designed to simplify
                     76: interfacing with Yacc, for those
                     77: with access to this compiler-compiler system.
                     78: ..
                     79: .\".nr PS 9
                     80: .\".nr VS 11
                     81: .AE
                     82: .2C
                     83: .NH
                     84: Introduction.
                     85: .PP
                     86: Lex is a program generator designed for
                     87: lexical processing of character input streams.
                     88: It accepts a high-level, problem oriented specification
                     89: for character string matching,
                     90: and
                     91: produces a program in a general purpose language which recognizes
                     92: regular expressions.
                     93: The regular expressions are specified by the user in the
                     94: source specifications given to Lex.
                     95: The Lex written code recognizes these expressions
                     96: in an input stream and partitions the input stream into
                     97: strings matching the expressions.  At the bound~aries
                     98: between strings
                     99: program sections
                    100: provided by the user are executed.
                    101: The Lex source file associates the regular expressions and the
                    102: program fragments.
                    103: As each expression appears in the input to the program written by Lex,
                    104: the corresponding fragment is executed.
                    105: .PP
                    106: .de MH
                    107: Bell Laboratories, Murray Hill, NJ 07974.
                    108: ..
                    109: The user supplies the additional code
                    110: beyond expression matching
                    111: needed to complete his tasks, possibly
                    112: including code written by other generators.
                    113: The program that recognizes the expressions is generated in the
                    114: general purpose programming language employed for the
                    115: user's program fragments.
                    116: Thus, a high level expression
                    117: language is provided to write the string expressions to be
                    118: matched while the user's freedom to write actions
                    119: is unimpaired.
                    120: This avoids forcing the user who wishes to use a string manipulation
                    121: language for input analysis to write processing programs in the same
                    122: and often inappropriate string handling language.
                    123: .PP
                    124: Lex is not a complete language, but rather a generator representing
                    125: a new language feature which can be added to
                    126: different programming languages, called ``host languages.'' 
                    127: Just as general purpose languages
                    128: can produce code to run on different computer hardware,
                    129: Lex can write code in different host languages.
                    130: The host language is used for the output code generated by Lex
                    131: and also for the program fragments added by the user.
                    132: Compatible run-time libraries for the different host languages
                    133: are also provided.
                    134: This makes Lex adaptable to different environments and
                    135: different users.
                    136: Each application
                    137: may be directed to the combination of hardware and host language appropriate
                    138: to the task, the user's background, and the properties of local
                    139: implementations.
                    140: At present, the only supported host language is C,
                    141: although Fortran (in the form of Ratfor [2] has been available
                    142: in the past.
                    143: Lex itself exists on UNIX, GCOS, and OS/370; but the
                    144: code generated by Lex may be taken anywhere the appropriate
                    145: compilers exist.
                    146: .PP
                    147: Lex turns the user's expressions and actions
                    148: (called
                    149: .ul
                    150: source
                    151: in this memo) into the host general-purpose language;
                    152: the generated program is named
                    153: .ul
                    154: yylex.
                    155: The
                    156: .ul
                    157: yylex
                    158: program
                    159: will recognize expressions
                    160: in a stream
                    161: (called
                    162: .ul
                    163: input
                    164: in this memo)
                    165: and perform the specified actions for each expression as it is detected.
                    166: See Figure 1.
                    167: .GS
                    168: .TS
                    169: center;
                    170: l _ r
                    171: l|c|r
                    172: l _ r
                    173: l _ r
                    174: l|c|r
                    175: l _ r
                    176: c s s
                    177: c s s.
                    178: 
                    179: Source \(->    Lex     \(-> yylex
                    180: 
                    181: .sp 2
                    182: 
                    183: Input \(->     yylex   \(-> Output
                    184: 
                    185: .sp
                    186: An overview of Lex
                    187: Figure 1
                    188: .TE
                    189: .GE
                    190: .PP
                    191: For a trivial example, consider a program to delete
                    192: from the input
                    193: all blanks or tabs at the ends of lines.
                    194: .TS
                    195: center;
                    196: l l.
                    197: %%
                    198: [ \et]+$       ;
                    199: .TE
                    200: is all that is required.
                    201: The program
                    202: contains a %% delimiter to mark the beginning of the rules, and
                    203: one rule.
                    204: This rule contains a regular expression
                    205: which matches one or more
                    206: instances of the characters blank or tab
                    207: (written \et for visibility, in accordance with the C language convention)
                    208: just prior to the end of a line.
                    209: The brackets indicate the character
                    210: class made of blank and tab; the + indicates ``one or more ...'';
                    211: and the $ indicates ``end of line,'' as in QED.
                    212: No action is specified,
                    213: so the program generated by Lex (yylex) will ignore these characters.
                    214: Everything else will be copied.
                    215: To change any remaining
                    216: string of blanks or tabs to a single blank,
                    217: add another rule:
                    218: .TS
                    219: center;
                    220: l l.
                    221: %%
                    222: [ \et]+$       ;
                    223: [ \et]+        printf(" ");
                    224: .TE
                    225: The finite automaton generated for this
                    226: source will scan for both rules at once,
                    227: observing at
                    228: the termination of the string of blanks or tabs
                    229: whether or not there is a newline character, and executing
                    230: the desired rule action.
                    231: The first rule matches all strings of blanks or tabs
                    232: at the end of lines, and the second
                    233: rule all remaining strings of blanks or tabs.
                    234: .PP
                    235: Lex can be used alone for simple transformations, or
                    236: for analysis and statistics gathering on a lexical level.
                    237: Lex can also be used with a parser generator
                    238: to perform the lexical analysis phase; it is particularly
                    239: easy to interface Lex and Yacc [3].
                    240: Lex programs recognize only regular expressions;
                    241: Yacc writes parsers that accept a large class of context free grammars,
                    242: but require a lower level analyzer to recognize input tokens.
                    243: Thus, a combination of Lex and Yacc is often appropriate.
                    244: When used as a preprocessor for a later parser generator,
                    245: Lex is used to partition the input stream,
                    246: and the parser generator assigns structure to
                    247: the resulting pieces.
                    248: The flow of control
                    249: in such a case (which might be the first half of a compiler,
                    250: for example) is shown in Figure 2.
                    251: Additional programs,
                    252: written by other generators
                    253: or by hand, can
                    254: be added easily to programs written by Lex.
                    255: .BS 2
                    256: .ps 9
                    257: .vs 11
                    258: .TS
                    259: center;
                    260: l c c c l
                    261: l c c c l
                    262: l c c c l
                    263: l _ c _ l
                    264: l|c|c|c|l
                    265: l _ c _ l
                    266: l c c c l
                    267: l _ c _ l
                    268: l|c|c|c|l
                    269: l _ c _ l
                    270: l c s s l
                    271: l c s s l.
                    272:        lexical         grammar
                    273:        rules           rules
                    274:        \(da            \(da
                    275: 
                    276:        Lex             Yacc
                    277: 
                    278:        \(da            \(da
                    279: 
                    280: Input \(->     yylex   \(->    yyparse \(-> Parsed input
                    281: 
                    282: .sp
                    283:        Lex with Yacc
                    284:        Figure 2
                    285: .TE
                    286: .ps 10
                    287: .vs 12
                    288: .BE
                    289: Yacc users
                    290: will realize that the name
                    291: .ul
                    292: yylex
                    293: is what Yacc expects its lexical analyzer to be named,
                    294: so that the use of this name by Lex simplifies
                    295: interfacing.
                    296: .PP
                    297: Lex generates a deterministic finite automaton from the regular expressions
                    298: in the source [4].
                    299: The automaton is interpreted, rather than compiled, in order
                    300: to save space.
                    301: The result is still a fast analyzer.
                    302: In particular, the time taken by a Lex program
                    303: to recognize and partition an input stream is
                    304: proportional to the length of the input.
                    305: The number of Lex rules or
                    306: the complexity of the rules is
                    307: not important in determining speed,
                    308: unless rules which include
                    309: forward context require a significant amount of re~scanning.
                    310: What does increase with the number and complexity of rules
                    311: is the size of the finite
                    312: automaton, and therefore the size of the program
                    313: generated by Lex.
                    314: .PP
                    315: In the program written by Lex, the user's fragments
                    316: (representing the
                    317: .ul
                    318: actions
                    319: to be performed as each regular expression
                    320: is found)
                    321: are gathered
                    322: as cases of a switch.
                    323: The automaton interpreter directs the control flow.
                    324: Opportunity is provided for the user to insert either
                    325: declarations or additional statements in the routine containing
                    326: the actions, or to
                    327: add subroutines outside this action routine.
                    328: .PP
                    329: Lex is not limited to source which can
                    330: be interpreted on the basis of one character
                    331: look~ahead.
                    332: For example,
                    333: if there are two rules, one looking for
                    334: .I ab
                    335: and another for
                    336: .I abcdefg ,
                    337: and the input stream is
                    338: .I abcdefh ,
                    339: Lex will recognize
                    340: .I ab
                    341: and leave
                    342: the input pointer just before
                    343: .I "cd. . ."
                    344: Such backup is more costly
                    345: than the processing of simpler languages.
                    346: .2C
                    347: .NH
                    348: Lex Source.
                    349: .PP
                    350: The general format of Lex source is:
                    351: .TS
                    352: center;
                    353: l.
                    354: {definitions}
                    355: %%
                    356: {rules}
                    357: %%
                    358: {user subroutines}
                    359: .TE
                    360: where the definitions and the user subroutines
                    361: are often omitted.
                    362: The second
                    363: .I %%
                    364: is optional, but the first is required
                    365: to mark the beginning of the rules.
                    366: The absolute minimum Lex program is thus
                    367: .TS
                    368: center;
                    369: l.
                    370: %%
                    371: .TE
                    372: (no definitions, no rules) which translates into a program
                    373: which copies the input to the output unchanged.
                    374: .PP
                    375: In the outline of Lex programs shown above, the
                    376: .I
                    377: rules
                    378: .R
                    379: represent the user's control
                    380: decisions; they are a table, in which the left column
                    381: contains
                    382: .I
                    383: regular expressions
                    384: .R
                    385: (see section 3)
                    386: and the right column contains
                    387: .I
                    388: actions,
                    389: .R
                    390: program fragments to be executed when the expressions
                    391: are recognized.
                    392: Thus an individual rule might appear
                    393: .TS
                    394: center;
                    395: l l.
                    396: integer        printf("found keyword INT");
                    397: .TE
                    398: to look for the string
                    399: .I integer
                    400: in the input stream and
                    401: print the message ``found keyword INT'' whenever it appears.
                    402: In this example the host procedural language is C and
                    403: the C library function
                    404: .I
                    405: printf
                    406: .R
                    407: is used to print the string.
                    408: The end
                    409: of the expression is indicated by the first blank or tab character.
                    410: If the action is merely a single C expression,
                    411: it can just be given on the right side of the line; if it is
                    412: compound, or takes more than a line, it should be enclosed in
                    413: braces.
                    414: As a slightly more useful example, suppose it is desired to
                    415: change a number of words from British to American spelling.
                    416: Lex rules such as
                    417: .TS
                    418: center;
                    419: l l.
                    420: colour printf("color");
                    421: mechanise      printf("mechanize");
                    422: petrol printf("gas");
                    423: .TE
                    424: would be a start.  These rules are not quite enough,
                    425: since
                    426: the word
                    427: .I petroleum
                    428: would become
                    429: .I gaseum ;
                    430: a way of dealing
                    431: with this will be described later.
                    432: .2C
                    433: .NH
                    434: Lex Regular Expressions.
                    435: .PP
                    436: The definitions of regular expressions are very similar to those
                    437: in QED [5].
                    438: A regular
                    439: expression specifies a set of strings to be matched.
                    440: It contains text characters (which match the corresponding
                    441: characters in the strings being compared)
                    442: and operator characters (which specify
                    443: repetitions, choices, and other features).
                    444: The letters of the alphabet and the digits are
                    445: always text characters; thus the regular expression
                    446: .TS
                    447: center;
                    448: l l.
                    449: integer
                    450: .TE
                    451: matches the string
                    452: .ul
                    453: integer
                    454: wherever it appears
                    455: and the expression
                    456: .TS
                    457: center;
                    458: l.
                    459: a57D
                    460: .TE
                    461: looks for the string
                    462: .ul
                    463: a57D.
                    464: .PP
                    465: .I
                    466: Operators.
                    467: .R
                    468: The operator characters are
                    469: .TS
                    470: center;
                    471: l.
                    472: " \e [ ] ^ \- ? . \(** + | ( ) $ / { } % < >
                    473: .TE
                    474: and if they are to be used as text characters, an escape
                    475: should be used.
                    476: The quotation mark operator (")
                    477: indicates that whatever is contained between a pair of quotes
                    478: is to be taken as text characters.
                    479: Thus
                    480: .TS
                    481: center;
                    482: l.
                    483: xyz"++"
                    484: .TE
                    485: matches the string
                    486: .I xyz++
                    487: when it appears.  Note that a part of a string may be quoted.
                    488: It is harmless but unnecessary to quote an ordinary
                    489: text character; the expression
                    490: .TS
                    491: center;
                    492: l.
                    493: "xyz++"
                    494: .TE
                    495: is the same as the one above.
                    496: Thus by quoting every non-alphanumeric character
                    497: being used as a text character, the user can avoid remembering
                    498: the list above of current
                    499: operator characters, and is safe should further extensions to Lex
                    500: lengthen the list.
                    501: .PP
                    502: An operator character may also be turned into a text character
                    503: by preceding it with \e as in
                    504: .TS
                    505: center;
                    506: l.
                    507: xyz\e+\e+
                    508: .TE
                    509: which
                    510: is another, less readable, equivalent of the above expressions.
                    511: Another use of the quoting mechanism is to get a blank into
                    512: an expression; normally, as explained above, blanks or tabs end
                    513: a rule.
                    514: Any blank character not contained within [\|] (see below) must
                    515: be quoted.
                    516: Several normal C escapes with \e
                    517: are recognized: \en is newline, \et is tab, and \eb is backspace.
                    518: To enter \e itself, use \e\e.
                    519: Since newline is illegal in an expression, \en must be used;
                    520: it is not
                    521: required to escape tab and backspace.
                    522: Every character but blank, tab, newline and the list above is always
                    523: a text character.
                    524: .PP
                    525: .I
                    526: Character classes.
                    527: .R
                    528: Classes of characters can be specified using the operator pair [\|].
                    529: The construction
                    530: .I [abc]
                    531: matches a
                    532: single character, which may be
                    533: .I a ,
                    534: .I b ,
                    535: or
                    536: .I c .
                    537: Within square brackets,
                    538: most operator meanings are ignored.
                    539: Only three characters are special:
                    540: these are \e \(mi and ^.  The \(mi character
                    541: indicates ranges.  For example,
                    542: .TS
                    543: center;
                    544: l.
                    545: [a\(miz0\(mi9<>_]
                    546: .TE
                    547: indicates the character class containing all the lower case letters,
                    548: the digits,
                    549: the angle brackets, and underline.
                    550: Ranges may be given in either order.
                    551: Using \(mi between any pair of characters which are
                    552: not both upper case letters, both lower case letters, or both digits
                    553: is implementation dependent and will get a warning message.
                    554: (E.g., [0\-z] in ASCII is many more characters
                    555: than it is in EBCDIC).
                    556: If it is desired to include the
                    557: character \(mi in a character class, it should be first or
                    558: last; thus
                    559: .TS
                    560: center;
                    561: l.
                    562: [\(mi+0\(mi9]
                    563: .TE
                    564: matches all the digits and the two signs.
                    565: .PP
                    566: In character classes,
                    567: the ^ operator must appear as the first character
                    568: after the left bracket; it indicates that the resulting string
                    569: is to be complemented with respect to the computer character set.
                    570: Thus
                    571: .TS
                    572: center;
                    573: l.
                    574: [^abc]
                    575: .TE
                    576: matches all characters except a, b, or c, including
                    577: all special or control characters; or
                    578: .TS
                    579: center;
                    580: l.
                    581: [^a\-zA\-Z]
                    582: .TE
                    583: is any character which is not a letter.
                    584: The \e character provides the usual escapes within
                    585: character class brackets.
                    586: .PP
                    587: .I
                    588: Arbitrary character.
                    589: .R
                    590: To match almost any character, the operator character
                    591: .TS
                    592: center;
                    593: l.
                    594: \&.
                    595: .TE
                    596: is the class of all characters except newline.
                    597: Escaping into octal is possible although non-portable:
                    598: .TS
                    599: center;
                    600: l.
                    601: [\e40\-\e176]
                    602: .TE
                    603: matches all printable characters in the ASCII character set, from octal
                    604: 40 (blank) to octal 176 (tilde).
                    605: .PP
                    606: .I
                    607: Optional expressions.
                    608: .R
                    609: The operator
                    610: .I ?
                    611: indicates
                    612: an optional element of an expression.
                    613: Thus
                    614: .TS
                    615: center;
                    616: l.
                    617: ab?c
                    618: .TE
                    619: matches either
                    620: .I ac
                    621: or
                    622: .I abc .
                    623: .PP
                    624: .I
                    625: Repeated expressions.
                    626: .R
                    627: Repetitions of classes are indicated by the operators
                    628: .I \(**
                    629: and
                    630: .I + .
                    631: .TS
                    632: center;
                    633: l.
                    634: \f2a\(**\f1
                    635: .TE
                    636: is any number of consecutive
                    637: .I a
                    638: characters, including zero; while
                    639: .TS
                    640: center;
                    641: l.
                    642: a+
                    643: .TE
                    644: is one or more instances of
                    645: .I a.
                    646: For example,
                    647: .TS
                    648: center;
                    649: l.
                    650: [a\-z]+
                    651: .TE
                    652: is all strings of lower case letters.
                    653: And
                    654: .TS
                    655: center;
                    656: l.
                    657: [A\(miZa\(miz][A\(miZa\(miz0\(mi9]\(**
                    658: .TE
                    659: indicates all alphanumeric strings with a leading
                    660: alphabetic character.
                    661: This is a typical expression for recognizing identifiers in
                    662: computer languages.
                    663: .PP
                    664: .I
                    665: Alternation and Grouping.
                    666: .R
                    667: The operator |
                    668: indicates alternation:
                    669: .TS
                    670: center;
                    671: l.
                    672: (ab\||\|cd)
                    673: .TE
                    674: matches either
                    675: .ul
                    676: ab
                    677: or
                    678: .ul
                    679: cd.
                    680: Note that parentheses are used for grouping, although
                    681: they are
                    682: not necessary on the outside level;
                    683: .TS
                    684: center;
                    685: l.
                    686: ab\||\|cd
                    687: .TE
                    688: would have sufficed.
                    689: Parentheses
                    690: can be used for more complex expressions:
                    691: .TS
                    692: center;
                    693: l.
                    694: (ab\||\|cd+)?(ef)\(**
                    695: .TE
                    696: matches such strings as
                    697: .I abefef ,
                    698: .I efefef ,
                    699: .I cdef ,
                    700: or
                    701: .I cddd\| ;
                    702: but not
                    703: .I abc ,
                    704: .I abcd ,
                    705: or
                    706: .I abcdef .
                    707: .PP
                    708: .I
                    709: Context sensitivity.
                    710: .R
                    711: Lex will recognize a small amount of surrounding
                    712: context.  The two simplest operators for this are
                    713: .I ^
                    714: and
                    715: .I $ .
                    716: If the first character of an expression is
                    717: .I ^ ,
                    718: the expression will only be matched at the beginning
                    719: of a line (after a newline character, or at the beginning of
                    720: the input stream).
                    721: This can never conflict with the other meaning of
                    722: .I ^ ,
                    723: complementation
                    724: of character classes, since that only applies within
                    725: the [\|] operators.
                    726: If the very last character is
                    727: .I $ ,
                    728: the expression will only be matched at the end of a line (when
                    729: immediately followed by newline).
                    730: The latter operator is a special case of the
                    731: .I /
                    732: operator character,
                    733: which indicates trailing context.
                    734: The expression
                    735: .TS
                    736: center;
                    737: l.
                    738: ab/cd
                    739: .TE
                    740: matches the string
                    741: .I ab ,
                    742: but only if followed by
                    743: .ul
                    744: cd.
                    745: Thus
                    746: .TS
                    747: center;
                    748: l.
                    749: ab$
                    750: .TE
                    751: is the same as
                    752: .TS
                    753: center;
                    754: l.
                    755: ab/\en
                    756: .TE
                    757: Left context is handled in Lex by
                    758: .I
                    759: start conditions
                    760: .R
                    761: as explained in section 10.  If a rule is only to be executed
                    762: when the Lex automaton interpreter is in start condition
                    763: .I
                    764: x,
                    765: .R
                    766: the rule should be prefixed by
                    767: .TS
                    768: center;
                    769: l.
                    770: <x>
                    771: .TE
                    772: using the angle bracket operator characters.
                    773: If we considered ``being at the beginning of a line'' to be
                    774: start condition
                    775: .I ONE ,
                    776: then the ^ operator
                    777: would be equivalent to
                    778: .TS
                    779: center;
                    780: l.
                    781: <ONE>
                    782: .TE
                    783: Start conditions are explained more fully later.
                    784: .PP
                    785: .I
                    786: Repetitions and Definitions.
                    787: .R
                    788: The operators {} specify
                    789: either repetitions (if they enclose numbers)
                    790: or
                    791: definition expansion (if they enclose a name).  For example
                    792: .TS
                    793: center;
                    794: l.
                    795: {digit}
                    796: .TE
                    797: looks for a predefined string named
                    798: .I digit
                    799: and inserts it
                    800: at that point in the expression.
                    801: The definitions are given in the first part of the Lex
                    802: input, before the rules.
                    803: In contrast,
                    804: .TS
                    805: center;
                    806: l.
                    807: a{1,5}
                    808: .TE
                    809: looks for 1 to 5 occurrences of
                    810: .I a .
                    811: .PP
                    812: Finally, initial
                    813: .I %
                    814: is special, being the separator
                    815: for Lex source segments.
                    816: .2C
                    817: .NH
                    818: Lex Actions.
                    819: .PP
                    820: When an expression written as above is matched, Lex
                    821: executes the corresponding action.  This section describes
                    822: some features of Lex which aid in writing actions.  Note
                    823: that there is a default action, which
                    824: consists of copying the input to the output.  This
                    825: is performed on all strings not otherwise matched.  Thus
                    826: the Lex user who wishes to absorb the entire input, without
                    827: producing any output, must provide rules to match everything.
                    828: When Lex is being used with Yacc, this is the normal
                    829: situation.
                    830: One may consider that actions are what is done instead of
                    831: copying the input to the output; thus, in general,
                    832: a rule which merely copies can be omitted.
                    833: Also, a character combination
                    834: which is omitted from the rules
                    835: and which appears as input
                    836: is likely to be printed on the output, thus calling
                    837: attention to the gap in the rules.
                    838: .PP
                    839: One of the simplest things that can be done is to ignore
                    840: the input.   Specifying a C null statement, \fI;\fR as an action
                    841: causes this result.  A frequent rule is
                    842: .TS
                    843: center;
                    844: l l.
                    845: [ \et\en]      ;
                    846: .TE
                    847: which causes the three spacing characters (blank, tab, and newline)
                    848: to be ignored.
                    849: .PP
                    850: Another easy way to avoid writing actions is the action character
                    851: |, which indicates that the action for this rule is the action
                    852: for the next rule.
                    853: The previous example could also have been written
                    854: .TS
                    855: center;
                    856: l l.
                    857: " "            |
                    858: "\et"          |
                    859: "\en"          ;
                    860: .TE
                    861: with the same result, although in different style.
                    862: The quotes around \en and \et are not required.
                    863: .PP
                    864: In more complex actions, the user
                    865: will
                    866: often want to know the actual text that matched some expression
                    867: like
                    868: .I [a\(miz]+ .
                    869: Lex leaves this text in an external character
                    870: array named
                    871: .I
                    872: yytext.
                    873: .R
                    874: Thus, to print the name found,
                    875: a rule like
                    876: .TS
                    877: center;
                    878: l l.
                    879: [a\-z]+        printf("%s", yytext);
                    880: .TE
                    881: will print
                    882: the string in
                    883: .I
                    884: yytext.
                    885: .R
                    886: The C function
                    887: .I
                    888: printf
                    889: .R
                    890: accepts a format argument and data to be printed;
                    891: in this case, the format is ``print string'' (% indicating
                    892: data conversion, and
                    893: .I s
                    894: indicating string type),
                    895: and the data are the characters
                    896: in
                    897: .I
                    898: yytext.
                    899: .R
                    900: So this just places
                    901: the matched string
                    902: on the output.
                    903: This action
                    904: is so common that
                    905: it may be written as ECHO:
                    906: .TS
                    907: center;
                    908: l l.
                    909: [a\-z]+        ECHO;
                    910: .TE
                    911: is the same as the above.
                    912: Since the default action is just to
                    913: print the characters found, one might ask why
                    914: give a rule, like this one, which merely specifies
                    915: the default action?
                    916: Such rules are often required
                    917: to avoid matching some other rule
                    918: which is not desired.  For example, if there is a rule
                    919: which matches
                    920: .I read
                    921: it will normally match the instances of
                    922: .I read
                    923: contained in
                    924: .I bread
                    925: or
                    926: .I readjust ;
                    927: to avoid
                    928: this,
                    929: a rule
                    930: of the form
                    931: .I [a\(miz]+
                    932: is needed.
                    933: This is explained further below.
                    934: .PP
                    935: Sometimes it is more convenient to know the end of what
                    936: has been found; hence Lex also provides a count
                    937: .I
                    938: yyleng
                    939: .R
                    940: of the number of characters matched.
                    941: To count both the number
                    942: of words and the number of characters in words in the input, the user might write
                    943: .TS
                    944: center;
                    945: l l.
                    946: [a\-zA\-Z]+    {words++; chars += yyleng;}
                    947: .TE
                    948: which accumulates in
                    949: .ul
                    950: chars
                    951: the number
                    952: of characters in the words recognized.
                    953: The last character in the string matched can
                    954: be accessed by
                    955: .TS
                    956: center;
                    957: l.
                    958: yytext[yyleng\-1]
                    959: .TE
                    960: .PP
                    961: Occasionally, a Lex
                    962: action may decide that a rule has not recognized the correct
                    963: span of characters.
                    964: Two routines are provided to aid with this situation.
                    965: First,
                    966: .I
                    967: yymore()
                    968: .R
                    969: can be called to indicate that the next input expression recognized is to be
                    970: tacked on to the end of this input.  Normally,
                    971: the next input string would overwrite the current
                    972: entry in
                    973: .I
                    974: yytext.
                    975: .R
                    976: Second,
                    977: .I
                    978: yyless (n)
                    979: .R
                    980: may be called to indicate that not all the characters matched
                    981: by the currently successful expression are wanted right now.
                    982: The argument
                    983: .I
                    984: n
                    985: .R
                    986: indicates the number of characters
                    987: in
                    988: .I
                    989: yytext
                    990: .R
                    991: to be retained.
                    992: Further characters previously matched
                    993: are
                    994: returned to the input.  This provides the same sort of
                    995: look~ahead offered by the / operator,
                    996: but in a different form.
                    997: .PP
                    998: .I
                    999: Example:
                   1000: .R
                   1001: Consider a language which defines
                   1002: a string as a set of characters between quotation (") marks, and provides that
                   1003: to include a " in a string it must be preceded by a \e.  The
                   1004: regular expression which matches that is somewhat confusing,
                   1005: so that it might be preferable to write
                   1006: .TS
                   1007: center;
                   1008: l l.
                   1009: \e"[^"]\(**    {
                   1010:        if (yytext[yyleng\-1] == \(fm\e\e\(fm)
                   1011:             yymore();
                   1012:        else
                   1013:             ... normal user processing
                   1014:        }
                   1015: .TE
                   1016: which will, when faced with a string such as
                   1017: .I
                   1018: "abc\e"def\|"
                   1019: .R
                   1020: first match
                   1021: the five characters
                   1022: \fI"abc\e\|\fR;
                   1023: then
                   1024: the call to
                   1025: .I yymore()
                   1026: will
                   1027: cause the next part of the string,
                   1028: \fI"def\|\fR,
                   1029: to be tacked on the end.
                   1030: Note that the final quote terminating the string should be picked
                   1031: up in the code labeled ``normal processing''.
                   1032: .PP
                   1033: The function
                   1034: .I
                   1035: yyless()
                   1036: .R
                   1037: might be used to reprocess
                   1038: text in various circumstances.  Consider the C problem of distinguishing
                   1039: the ambiguity of ``=\(mia''.
                   1040: Suppose it is desired to treat this as ``=\(mi a''
                   1041: but print a message.  A rule might be
                   1042: .ps 9
                   1043: .vs 11
                   1044: .TS
                   1045: center;
                   1046: l l.
                   1047: =\(mi[a\-zA\-Z]        {
                   1048:        printf("Op (=\(mi) ambiguous\en");
                   1049:        yyless(yyleng\-1);
                   1050:        ... action for =\(mi ...
                   1051:        }
                   1052: .TE
                   1053: .ps 10
                   1054: .vs 12
                   1055: which prints a message, returns the letter after the
                   1056: operator to the input stream, and treats the operator as ``=\(mi''.
                   1057: Alternatively it might be desired to treat this as ``=  \(mia''.
                   1058: To do this, just return the minus
                   1059: sign as well as the letter to the input:
                   1060: .ps 9
                   1061: .vs 11
                   1062: .TS
                   1063: center;
                   1064: l l.
                   1065: =\(mi[a\-zA\-Z]        {
                   1066:        printf("Op (=\(mi) ambiguous\en");
                   1067:        yyless(yyleng\-2);
                   1068:        ... action for = ...
                   1069:        }
                   1070: .TE
                   1071: .ps 10
                   1072: .vs 12
                   1073: will perform the other interpretation.
                   1074: Note that the expressions for the two cases might more easily
                   1075: be written
                   1076: .TS
                   1077: center;
                   1078: l l.
                   1079: =\(mi/[A\-Za\-z]
                   1080: .TE
                   1081: in the first case and
                   1082: .TS
                   1083: center;
                   1084: l.
                   1085: =/\-[A\-Za\-z]
                   1086: .TE
                   1087: in the second;
                   1088: no backup would be required in the rule action.
                   1089: It is not necessary to recognize the whole identifier
                   1090: to observe the ambiguity.
                   1091: The
                   1092: possibility of ``=\(mi3'', however, makes
                   1093: .TS
                   1094: center;
                   1095: l.
                   1096: =\(mi/[^ \et\en]
                   1097: .TE
                   1098: a still better rule.
                   1099: .PP
                   1100: In addition to these routines, Lex also permits
                   1101: access to the I/O routines
                   1102: it uses.
                   1103: They are:
                   1104: .IP 1)
                   1105: .I
                   1106: input()
                   1107: .R
                   1108: which returns the next input character;
                   1109: .IP 2)
                   1110: .I
                   1111: output(c)
                   1112: .R
                   1113: which writes the character
                   1114: .I
                   1115: c
                   1116: .R
                   1117: on the output; and
                   1118: .IP 3)
                   1119: .I
                   1120: unput(c)
                   1121: .R
                   1122: pushes the character
                   1123: .I
                   1124: c
                   1125: .R
                   1126: back onto the input stream to be read later by
                   1127: .I
                   1128: input().
                   1129: .R
                   1130: .LP
                   1131: By default these routines are provided as macro definitions,
                   1132: but the user can override them and supply private versions.
                   1133: These routines
                   1134: define the relationship between external files and
                   1135: internal characters, and must all be retained
                   1136: or modified consistently.
                   1137: They may be redefined, to
                   1138: cause input or output to be transmitted to or from strange
                   1139: places, including other programs or internal memory;
                   1140: but the character set used must be consistent in all routines;
                   1141: a value of zero returned by
                   1142: .I
                   1143: input
                   1144: .R
                   1145: must mean end of file; and
                   1146: the relationship between
                   1147: .I
                   1148: unput
                   1149: .R
                   1150: and
                   1151: .I
                   1152: input
                   1153: .R
                   1154: must be retained
                   1155: or the Lex look~ahead will not work.
                   1156: Lex does not look ahead at all if it does not have to,
                   1157: but every rule ending in
                   1158: .ft I
                   1159: + \(** ?
                   1160: .ft R
                   1161: or
                   1162: .ft I
                   1163: $
                   1164: .ft R
                   1165: or containing
                   1166: .ft I
                   1167: /
                   1168: .ft R
                   1169: implies look~ahead.
                   1170: Look~ahead is also necessary to match an expression that is a prefix
                   1171: of another expression.
                   1172: See below for a discussion of the character set used by Lex.
                   1173: The standard Lex library imposes
                   1174: a 100 character limit on backup.
                   1175: .PP
                   1176: Another Lex library routine that the user will sometimes want
                   1177: to redefine is
                   1178: .I
                   1179: yywrap()
                   1180: .R
                   1181: which is called whenever Lex reaches an end-of-file.
                   1182: If
                   1183: .I
                   1184: yywrap
                   1185: .R
                   1186: returns a 1, Lex continues with the normal wrapup on end of input.
                   1187: Sometimes, however, it is convenient to arrange for more
                   1188: input to arrive
                   1189: from a new source.
                   1190: In this case, the user should provide
                   1191: a
                   1192: .I
                   1193: yywrap
                   1194: .R
                   1195: which
                   1196: arranges for new input and
                   1197: returns 0.  This instructs Lex to continue processing.
                   1198: The default
                   1199: .I
                   1200: yywrap
                   1201: .R
                   1202: always returns 1.
                   1203: .PP
                   1204: This routine is also a convenient place
                   1205: to print tables, summaries, etc. at the end
                   1206: of a program.  Note that it is not
                   1207: possible to write a normal rule which recognizes
                   1208: end-of-file; the only access to this condition is
                   1209: through
                   1210: .I
                   1211: yywrap.
                   1212: .R
                   1213: In fact, unless a private version of
                   1214: .I
                   1215: input()
                   1216: .R
                   1217: is supplied
                   1218: a file containing nulls
                   1219: cannot be handled,
                   1220: since a value of 0 returned by
                   1221: .I
                   1222: input
                   1223: .R
                   1224: is taken to be end-of-file.
                   1225: .PP
                   1226: .2C
                   1227: .NH
                   1228: Ambiguous Source Rules.
                   1229: .PP
                   1230: Lex can handle ambiguous specifications.
                   1231: When more than one expression can match the
                   1232: current input, Lex chooses as follows:
                   1233: .IP 1)
                   1234: The longest match is preferred.
                   1235: .IP 2)
                   1236: Among rules which matched the same number of characters,
                   1237: the rule given first is preferred.
                   1238: .LP
                   1239: Thus, suppose the rules
                   1240: .TS
                   1241: center;
                   1242: l l.
                   1243: integer        keyword action ...;
                   1244: [a\-z]+        identifier action ...;
                   1245: .TE
                   1246: to be given in that order.  If the input is
                   1247: .I integers ,
                   1248: it is taken as an identifier, because
                   1249: .I [a\-z]+
                   1250: matches 8 characters while
                   1251: .I integer
                   1252: matches only 7.
                   1253: If the input is
                   1254: .I integer ,
                   1255: both rules match 7 characters, and
                   1256: the keyword rule is selected because it was given first.
                   1257: Anything shorter (e.g. \fIint\fR\|) will
                   1258: not match the expression
                   1259: .I integer
                   1260: and so the identifier interpretation is used.
                   1261: .PP
                   1262: The principle of preferring the longest
                   1263: match makes rules containing
                   1264: expressions like
                   1265: .I \&.\(**
                   1266: dangerous.
                   1267: For example,
                   1268: .TS
                   1269: center;
                   1270: l.
                   1271: \&\(fm.\(**\(fm
                   1272: .TE
                   1273: might seem a good way of recognizing
                   1274: a string in single quotes.
                   1275: But it is an invitation for the program to read far
                   1276: ahead, looking for a distant
                   1277: single quote.
                   1278: Presented with the input
                   1279: .TS
                   1280: center;
                   1281: l l.
                   1282: \&\(fmfirst\(fm quoted string here, \(fmsecond\(fm here
                   1283: .TE
                   1284: the above expression will match
                   1285: .TS
                   1286: center;
                   1287: l l.
                   1288: \&\(fmfirst\(fm quoted string here, \(fmsecond\(fm
                   1289: .TE
                   1290: which is probably not what was wanted.
                   1291: A better rule is of the form
                   1292: .TS
                   1293: center;
                   1294: l.
                   1295: \&\(fm[^\(fm\en]\(**\(fm
                   1296: .TE
                   1297: which, on the above input, will stop
                   1298: after
                   1299: .I \(fmfirst\(fm .
                   1300: The consequences
                   1301: of errors like this are mitigated by the fact
                   1302: that the
                   1303: .I \&.
                   1304: operator will not match newline.
                   1305: Thus expressions like
                   1306: .I \&.\(**
                   1307: stop on the
                   1308: current line.
                   1309: Don't try to defeat this with expressions like
                   1310: .I (.|\en)+
                   1311: or
                   1312: equivalents;
                   1313: the Lex generated program will try to read
                   1314: the entire input file, causing
                   1315: internal buffer overflows.
                   1316: .PP
                   1317: Note that Lex is normally partitioning
                   1318: the input stream, not searching for all possible matches
                   1319: of each expression.
                   1320: This means that each character is accounted for
                   1321: once and only once.
                   1322: For example, suppose it is desired to
                   1323: count occurrences of both \fIshe\fR and \fIhe\fR in an input text.
                   1324: Some Lex rules to do this might be
                   1325: .TS
                   1326: center;
                   1327: l l.
                   1328: she    s++;
                   1329: he     h++;
                   1330: \en    |
                   1331: \&.    ;
                   1332: .TE
                   1333: where the last two rules ignore everything besides \fIhe\fR and \fIshe\fR.
                   1334: Remember that . does not include newline.
                   1335: Since \fIshe\fR includes \fIhe\fR, Lex will normally
                   1336: .I
                   1337: not
                   1338: .R
                   1339: recognize
                   1340: the instances of \fIhe\fR included in \fIshe\fR,
                   1341: since once it has passed a \fIshe\fR those characters are gone.
                   1342: .PP
                   1343: Sometimes the user would like to override this choice.  The action
                   1344: REJECT
                   1345: means ``go do the next alternative.''
                   1346: It causes whatever rule was second choice after the current
                   1347: rule to be executed.
                   1348: The position of the input pointer is adjusted accordingly.
                   1349: Suppose the user really wants to count the included instances of \fIhe\fR:
                   1350: .TS
                   1351: center;
                   1352: l l.
                   1353: she    {s++; REJECT;}
                   1354: he     {h++; REJECT;}
                   1355: \en    |
                   1356: \&.    ;
                   1357: .TE
                   1358: these rules are one way of changing the previous example
                   1359: to do just that.
                   1360: After counting each expression, it is rejected; whenever appropriate,
                   1361: the other expression will then be counted.  In this example, of course,
                   1362: the user could note that \fIshe\fR includes \fIhe\fR but not
                   1363: vice versa, and omit the REJECT action on \fIhe\fR;
                   1364: in other cases, however, it
                   1365: would not be possible a priori to tell
                   1366: which input characters
                   1367: were in both classes.
                   1368: .PP
                   1369: Consider the two rules
                   1370: .TS
                   1371: center;
                   1372: l l.
                   1373: a[bc]+ { ... ; REJECT;}
                   1374: a[cd]+ { ... ; REJECT;}
                   1375: .TE
                   1376: If the input is
                   1377: .I ab ,
                   1378: only the first rule matches,
                   1379: and on
                   1380: .I ad
                   1381: only the second matches.
                   1382: The input string
                   1383: .I accb
                   1384: matches the first rule for four characters
                   1385: and then the second rule for three characters.
                   1386: In contrast, the input
                   1387: .I accd
                   1388: agrees with
                   1389: the second rule for four characters and then the first
                   1390: rule for three.
                   1391: .PP
                   1392: In general, REJECT is useful whenever
                   1393: the purpose of Lex is not to partition the input
                   1394: stream but to detect all examples of some items
                   1395: in the input, and the instances of these items
                   1396: may overlap or include each other.
                   1397: Suppose a digram table of the input is desired;
                   1398: normally the digrams overlap, that is the word
                   1399: .I the
                   1400: is considered to contain
                   1401: both
                   1402: .I th
                   1403: and
                   1404: .I he .
                   1405: Assuming a two-dimensional array named
                   1406: .ul
                   1407: digram
                   1408: to be incremented, the appropriate
                   1409: source is
                   1410: .TS
                   1411: center;
                   1412: l l.
                   1413: %%
                   1414: [a\-z][a\-z]   {
                   1415:        digram[yytext[0]][yytext[1]]++;
                   1416:        REJECT;
                   1417:        }
                   1418: \.     ;
                   1419: \en    ;
                   1420: .TE
                   1421: where the REJECT is necessary to pick up
                   1422: a letter pair beginning at every character, rather than at every
                   1423: other character.
                   1424: .2C
                   1425: .NH
                   1426: Lex Source Definitions.
                   1427: .PP
                   1428: Remember the format of the Lex
                   1429: source:
                   1430: .TS
                   1431: center;
                   1432: l.
                   1433: {definitions}
                   1434: %%
                   1435: {rules}
                   1436: %%
                   1437: {user routines}
                   1438: .TE
                   1439: So far only the rules have been described.  The user needs
                   1440: additional options,
                   1441: though, to define variables for use in his program and for use
                   1442: by Lex.
                   1443: These can go either in the definitions section
                   1444: or in the rules section.
                   1445: .PP
                   1446: Remember that Lex is turning the rules into a program.
                   1447: Any source not intercepted by Lex is copied
                   1448: into the generated program.  There are three classes
                   1449: of such things.
                   1450: .IP 1)
                   1451: Any line which is not part of a Lex rule or action
                   1452: which begins with a blank or tab is copied into
                   1453: the Lex generated program.
                   1454: Such source input prior to the first %% delimiter will be external
                   1455: to any function in the code; if it appears immediately after the first
                   1456: %%,
                   1457: it appears in an appropriate place for declarations
                   1458: in the function written by Lex which contains the actions.
                   1459: This material must look like program fragments,
                   1460: and should precede the first Lex rule.
                   1461: .IP
                   1462: As a side effect of the above, lines which begin with a blank
                   1463: or tab, and which contain a comment,
                   1464: are passed through to the generated program.
                   1465: This can be used to include comments in either the Lex source or
                   1466: the generated code.  The comments should follow the host
                   1467: language convention.
                   1468: .IP 2)
                   1469: Anything included between lines containing
                   1470: only
                   1471: .I %{
                   1472: and
                   1473: .I %}
                   1474: is
                   1475: copied out as above.  The delimiters are discarded.
                   1476: This format permits entering text like preprocessor statements that
                   1477: must begin in column 1,
                   1478: or copying lines that do not look like programs.
                   1479: .IP 3)
                   1480: Anything after the third %% delimiter, regardless of formats, etc.,
                   1481: is copied out after the Lex output.
                   1482: .PP
                   1483: Definitions intended for Lex are given
                   1484: before the first %% delimiter.  Any line in this section
                   1485: not contained between %{ and %}, and begining
                   1486: in column 1, is assumed to define Lex substitution strings.
                   1487: The format of such lines is
                   1488: .TS
                   1489: center;
                   1490: l l.
                   1491: name translation
                   1492: .TE
                   1493: and it
                   1494: causes the string given as a translation to
                   1495: be associated with the name.
                   1496: The name and translation
                   1497: must be separated by at least one blank or tab, and the name must begin with a letter.
                   1498: The translation can then be called out
                   1499: by the {name} syntax in a rule.
                   1500: Using {D} for the digits and {E} for an exponent field,
                   1501: for example, might abbreviate rules to recognize numbers:
                   1502: .TS
                   1503: center;
                   1504: l l.
                   1505: D      [0\-9]
                   1506: E      [DEde][\-+]?{D}+
                   1507: %%
                   1508: {D}+   printf("integer");
                   1509: {D}+"."{D}\(**({E})?   |
                   1510: {D}\(**"."{D}+({E})?   |
                   1511: {D}+{E}                printf("real");
                   1512: .TE
                   1513: Note the first two rules for real numbers;
                   1514: both require a decimal point and contain
                   1515: an optional exponent field,
                   1516: but the first requires at least one digit before the
                   1517: decimal point and the second requires at least one
                   1518: digit after the decimal point.
                   1519: To correctly handle the problem
                   1520: posed by a Fortran expression such as
                   1521: .I 35.EQ.I ,
                   1522: which does not contain a real number, a context-sensitive
                   1523: rule such as
                   1524: .TS
                   1525: center;
                   1526: l l.
                   1527: [0\-9]+/"."EQ  printf("integer");
                   1528: .TE
                   1529: could be used in addition to the normal rule for integers.
                   1530: .PP
                   1531: The definitions
                   1532: section may also contain other commands, including the
                   1533: selection of a host language, a character set table,
                   1534: a list of start conditions, or adjustments to the default
                   1535: size of arrays within Lex itself for larger source programs.
                   1536: These possibilities
                   1537: are discussed below under ``Summary of Source Format,''
                   1538: section 12.
                   1539: .2C
                   1540: .NH
                   1541: Usage.
                   1542: .PP
                   1543: There are two steps in
                   1544: compiling a Lex source program.
                   1545: First, the Lex source must be turned into a generated program
                   1546: in the host general purpose language.
                   1547: Then this program must be compiled and loaded, usually with
                   1548: a library of Lex subroutines.
                   1549: The generated program
                   1550: is on a file named
                   1551: .I lex.yy.c .
                   1552: The I/O library is defined in terms of the C standard
                   1553: library [6].
                   1554: .PP
                   1555: The C programs generated by Lex are slightly different
                   1556: on OS/370, because the
                   1557: OS compiler is less powerful than the UNIX or GCOS compilers,
                   1558: and does less at compile time.
                   1559: C programs generated on GCOS and UNIX are the same.
                   1560: .PP
                   1561: .I
                   1562: UNIX.
                   1563: .R
                   1564: The library is accessed by the loader flag
                   1565: .I \-ll .
                   1566: So an appropriate
                   1567: set of commands is
                   1568: .KS
                   1569: .in 5
                   1570: lex source
                   1571: cc lex.yy.c \-ll
                   1572: .in 0
                   1573: .KE
                   1574: The resulting program is placed on the usual file
                   1575: .I
                   1576: a.out
                   1577: .R
                   1578: for later execution.
                   1579: To use Lex with Yacc see below.
                   1580: Although the default Lex I/O routines use the C standard library,
                   1581: the Lex automata themselves do not do so;
                   1582: if private versions of
                   1583: .I
                   1584: input,
                   1585: output
                   1586: .R
                   1587: and
                   1588: .I unput
                   1589: are given, the library can be avoided.
                   1590: .PP
                   1591: .2C
                   1592: .NH
                   1593: Lex and Yacc.
                   1594: .PP
                   1595: If you want to use Lex with Yacc, note that what Lex writes is a program
                   1596: named
                   1597: .I
                   1598: yylex(),
                   1599: .R
                   1600: the name required by Yacc for its analyzer.
                   1601: Normally, the default main program on the Lex library
                   1602: calls this routine, but if Yacc is loaded, and its main
                   1603: program is used, Yacc will call
                   1604: .I
                   1605: yylex().
                   1606: .R
                   1607: In this case each Lex rule should end with
                   1608: .TS
                   1609: center;
                   1610: l.
                   1611: return(token);
                   1612: .TE
                   1613: where the appropriate token value is returned.
                   1614: An easy way to get access
                   1615: to Yacc's names for tokens is to
                   1616: compile the Lex output file as part of
                   1617: the Yacc output file by placing the line
                   1618: .TS
                   1619: center;
                   1620: l.
                   1621: # include "lex.yy.c"
                   1622: .TE
                   1623: in the last section of Yacc input.
                   1624: Supposing the grammar to be
                   1625: named ``good'' and the lexical rules to be named ``better''
                   1626: the UNIX command sequence can just be:
                   1627: .TS
                   1628: center;
                   1629: l.
                   1630: yacc good
                   1631: lex better
                   1632: cc y.tab.c \-ly \-ll
                   1633: .TE
                   1634: The Yacc library (\-ly) should be loaded before the Lex library,
                   1635: to obtain a main program which invokes the Yacc parser.
                   1636: The generations of Lex and Yacc programs can be done in
                   1637: either order.
                   1638: .2C
                   1639: .NH
                   1640: Examples.
                   1641: .PP
                   1642: As a trivial problem, consider copying an input file while
                   1643: adding 3 to every positive number divisible by 7.
                   1644: Here is a suitable Lex source program
                   1645: .TS
                   1646: center;
                   1647: l l.
                   1648: %%
                   1649:        int k;
                   1650: [0\-9]+        {
                   1651:        k = atoi(yytext);
                   1652:        if (k%7 == 0)
                   1653:             printf("%d", k+3);
                   1654:        else
                   1655:             printf("%d",k);
                   1656:        }
                   1657: .TE
                   1658: to do just that.
                   1659: The rule [0\-9]+ recognizes strings of digits;
                   1660: .I
                   1661: atoi
                   1662: .R
                   1663: converts the digits to binary
                   1664: and stores the result in
                   1665: .ul
                   1666: k.
                   1667: The operator % (remainder) is used to check whether
                   1668: .ul
                   1669: k
                   1670: is divisible by 7; if it is,
                   1671: it is incremented by 3 as it is written out.
                   1672: It may be objected that this program will alter such
                   1673: input items as
                   1674: .I 49.63
                   1675: or
                   1676: .I X7 .
                   1677: Furthermore, it increments the absolute value
                   1678: of all negative numbers divisible by 7.
                   1679: To avoid this, just add a few more rules after the active one,
                   1680: as here:
                   1681: .TS
                   1682: center;
                   1683: l l.
                   1684: %%
                   1685:        int k;
                   1686: \-?[0\-9]+     {
                   1687:        k = atoi(yytext);
                   1688:        printf("%d",
                   1689:          k%7 == 0 ? k+3 : k);
                   1690:        }
                   1691: \-?[0\-9.]+    ECHO;
                   1692: [A-Za-z][A-Za-z0-9]+   ECHO;
                   1693: .TE
                   1694: Numerical strings containing
                   1695: a ``.'' or preceded by a letter will be picked up by
                   1696: one of the last two rules, and not changed.
                   1697: The
                   1698: .I if\-else
                   1699: has been replaced by
                   1700: a C conditional expression to save space;
                   1701: the form
                   1702: .ul
                   1703: a?b:c
                   1704: means ``if
                   1705: .I a
                   1706: then
                   1707: .I b
                   1708: else
                   1709: .I c ''.
                   1710: .PP
                   1711: For an example of statistics gathering, here
                   1712: is a program which histograms the lengths
                   1713: of words, where a word is defined as a string of letters.
                   1714: .TS
                   1715: center;
                   1716: l l.
                   1717:        int lengs[100];
                   1718: %%
                   1719: [a\-z]+        lengs[yyleng]++;
                   1720: \&.    |
                   1721: \en    ;
                   1722: %%
                   1723: .T&
                   1724: l s.
                   1725: yywrap()
                   1726: {
                   1727: int i;
                   1728: printf("Length  No. words\en");
                   1729: for(i=0; i<100; i++)
                   1730:      if (lengs[i] > 0)
                   1731:           printf("%5d%10d\en",i,lengs[i]);
                   1732: return(1);
                   1733: }
                   1734: .TE
                   1735: This program
                   1736: accumulates the histogram, while producing no output.  At the end
                   1737: of the input it prints the table.
                   1738: The final statement
                   1739: .I
                   1740: return(1);
                   1741: .R
                   1742: indicates that Lex is to perform wrapup.  If
                   1743: .I
                   1744: yywrap
                   1745: .R
                   1746: returns zero (false)
                   1747: it implies that further input is available
                   1748: and the program is
                   1749: to continue reading and processing.
                   1750: To provide a
                   1751: .I
                   1752: yywrap
                   1753: .R
                   1754: that never
                   1755: returns true causes an infinite loop.
                   1756: .PP
                   1757: As a larger example,
                   1758: here are some parts of a program written by N. L. Schryer
                   1759: to convert double precision Fortran to single precision Fortran.
                   1760: Because Fortran does not distinguish upper and lower case letters,
                   1761: this routine begins by defining a set of classes including
                   1762: both cases of each letter:
                   1763: .TS
                   1764: center;
                   1765: l l.
                   1766: a      [aA]
                   1767: b      [bB]
                   1768: c      [cC]
                   1769: \&...
                   1770: z      [zZ]
                   1771: .TE
                   1772: An additional class recognizes white space:
                   1773: .TS
                   1774: center;
                   1775: l l.
                   1776: W      [ \et]\(**
                   1777: .TE
                   1778: The first rule changes
                   1779: ``double precision'' to ``real'', or ``DOUBLE PRECISION'' to ``REAL''.
                   1780: .TS
                   1781: center;
                   1782: l.
                   1783: {d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n} {
                   1784:      printf(yytext[0]==\(fmd\(fm? "real" : "REAL");
                   1785:      }
                   1786: .TE
                   1787: Care is taken throughout this program to preserve the case
                   1788: (upper or lower)
                   1789: of the original program.
                   1790: The conditional operator is used to
                   1791: select the proper form of the keyword.
                   1792: The next rule copies continuation card indications to
                   1793: avoid confusing them with constants:
                   1794: .TS
                   1795: center;
                   1796: l l.
                   1797: ^"     "[^ 0]  ECHO;
                   1798: .TE
                   1799: In the regular expression, the quotes surround the
                   1800: blanks.
                   1801: It is interpreted as
                   1802: ``beginning of line, then five blanks, then
                   1803: anything but blank or zero.'' 
                   1804: Note the two different meanings of
                   1805: .I ^ .
                   1806: There follow some rules to change double precision
                   1807: constants to ordinary floating constants.
                   1808: .TS
                   1809: center;
                   1810: l.
                   1811: [0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+     |
                   1812: [0\-9]+{W}"."{W}{d}{W}[+\-]?{W}[0\-9]+     |
                   1813: "."{W}[0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+     {
                   1814:      /\(** convert constants \(**/
                   1815:      for(p=yytext; \(**p != 0; p++)
                   1816:           {
                   1817:           if (\(**p == \(fmd\(fm || \(**p == \(fmD\(fm)
                   1818:                \(**p=+ \(fme\(fm\- \(fmd\(fm;
                   1819:           ECHO;
                   1820:           }
                   1821: .TE
                   1822: After the floating point constant is recognized, it is
                   1823: scanned by the
                   1824: .ul
                   1825: for
                   1826: loop
                   1827: to find the letter
                   1828: .I d
                   1829: or
                   1830: .I D .
                   1831: The program than adds
                   1832: .c
                   1833: .I \(fme\(fm\-\(fmd\(fm ,
                   1834: which converts
                   1835: it to the next letter of the alphabet.
                   1836: The modified constant, now single-precision,
                   1837: is written out again.
                   1838: There follow a series of names which must be respelled to remove
                   1839: their initial \fId\fR.
                   1840: By using the
                   1841: array
                   1842: .I
                   1843: yytext
                   1844: .R
                   1845: the same action suffices for all the names (only a sample of
                   1846: a rather long list is given here).
                   1847: .TS
                   1848: center;
                   1849: l l.
                   1850: {d}{s}{i}{n}   |
                   1851: {d}{c}{o}{s}   |
                   1852: {d}{s}{q}{r}{t}        |
                   1853: {d}{a}{t}{a}{n}        |
                   1854: \&...
                   1855: {d}{f}{l}{o}{a}{t}     printf("%s",yytext+1);
                   1856: .TE
                   1857: Another list of names must have initial \fId\fR changed to initial \fIa\fR:
                   1858: .TS
                   1859: center;
                   1860: l l.
                   1861: {d}{l}{o}{g}   |
                   1862: {d}{l}{o}{g}10 |
                   1863: {d}{m}{i}{n}1  |
                   1864: {d}{m}{a}{x}1  {
                   1865:        yytext[0] =+ \(fma\(fm \- \(fmd\(fm;
                   1866:        ECHO;
                   1867:        }
                   1868: .TE
                   1869: And one routine
                   1870: must have initial \fId\fR changed to initial \fIr\fR:
                   1871: .TS
                   1872: center;
                   1873: l l.
                   1874: {d}1{m}{a}{c}{h}       {yytext[0] =+ \(fmr\(fm  \- \(fmd\(fm;
                   1875:                ECHO;
                   1876:                }
                   1877: .TE
                   1878: To avoid such names as \fIdsinx\fR being detected as instances
                   1879: of \fIdsin\fR, some final rules pick up longer words as identifiers
                   1880: and copy some surviving characters:
                   1881: .TS
                   1882: center;
                   1883: l l.
                   1884: [A\-Za\-z][A\-Za\-z0\-9]\(**   |
                   1885: [0\-9]+        |
                   1886: \en    |
                   1887: \&.    ECHO;
                   1888: .TE
                   1889: Note that this program is not complete; it
                   1890: does not deal with the spacing problems in Fortran or
                   1891: with the use of keywords as identifiers.
                   1892: .br
                   1893: .2C
                   1894: .NH
                   1895: Left Context Sensitivity.
                   1896: .PP
                   1897: Sometimes
                   1898: it is desirable to have several sets of lexical rules
                   1899: to be applied at different times in the input.
                   1900: For example, a compiler preprocessor might distinguish
                   1901: preprocessor statements and analyze them differently
                   1902: from ordinary statements.
                   1903: This requires
                   1904: sensitivity
                   1905: to prior context, and there are several ways of handling
                   1906: such problems.
                   1907: The \fI^\fR operator, for example, is a prior context operator,
                   1908: recognizing immediately preceding left context just as \fI$\fR recognizes
                   1909: immediately following right context.
                   1910: Adjacent left context could be extended, to produce a facility similar to
                   1911: that for adjacent right context, but it is unlikely
                   1912: to be as useful, since often the relevant left context
                   1913: appeared some time earlier, such as at the beginning of a line.
                   1914: .PP
                   1915: This section describes three means of dealing
                   1916: with different environments: a simple use of flags,
                   1917: when only a few rules change from one environment to another,
                   1918: the use of
                   1919: .I
                   1920: start conditions
                   1921: .R
                   1922: on rules,
                   1923: and the possibility of making multiple lexical analyzers all run
                   1924: together.
                   1925: In each case, there are rules which recognize the need to change the
                   1926: environment in which the
                   1927: following input text is analyzed, and set some parameter
                   1928: to reflect the change.  This may be a flag explicitly tested by
                   1929: the user's action code; such a flag is the simplest way of dealing
                   1930: with the problem, since Lex is not involved at all.
                   1931: It may be more convenient,
                   1932: however,
                   1933: to have Lex remember the flags as initial conditions on the rules.
                   1934: Any rule may be associated with a start condition.  It will only
                   1935: be recognized when Lex is in
                   1936: that start condition.
                   1937: The current start condition may be changed at any time.
                   1938: Finally, if the sets of rules for the different environments
                   1939: are very dissimilar,
                   1940: clarity may be best achieved by writing several distinct lexical
                   1941: analyzers, and switching from one to another as desired.
                   1942: .PP
                   1943: Consider the following problem: copy the input to the output,
                   1944: changing the word \fImagic\fR to \fIfirst\fR on every line which began
                   1945: with the letter \fIa\fR, changing \fImagic\fR to \fIsecond\fR on every line
                   1946: which began with the letter \fIb\fR, and changing
                   1947: \fImagic\fR to \fIthird\fR on every line which began
                   1948: with the letter \fIc\fR.  All other words and all other lines
                   1949: are left unchanged.
                   1950: .PP
                   1951: These rules are so simple that the easiest way
                   1952: to do this job is with a flag:
                   1953: .TS
                   1954: center;
                   1955: l l.
                   1956:        int flag;
                   1957: %%
                   1958: ^a     {flag = \(fma\(fm; ECHO;}
                   1959: ^b     {flag = \(fmb\(fm; ECHO;}
                   1960: ^c     {flag = \(fmc\(fm; ECHO;}
                   1961: \en    {flag =  0 ; ECHO;}
                   1962: magic  {
                   1963:        switch (flag)
                   1964:        {
                   1965:        case \(fma\(fm: printf("first"); break;
                   1966:        case \(fmb\(fm: printf("second"); break;
                   1967:      case \(fmc\(fm: printf("third"); break;
                   1968:        default: ECHO; break;
                   1969:        }
                   1970:        }
                   1971: .TE
                   1972: should be adequate.
                   1973: .PP
                   1974: To handle the same problem with start conditions, each
                   1975: start condition must be introduced to Lex in the definitions section
                   1976: with a line reading
                   1977: .TS
                   1978: center;
                   1979: l l.
                   1980: %Start name1 name2 ...
                   1981: .TE
                   1982: where the conditions may be named in any order.
                   1983: The word \fIStart\fR may be abbreviated to \fIs\fR or \fIS\fR.
                   1984: The conditions may be referenced at the
                   1985: head of a rule with the <> brackets:
                   1986: .TS
                   1987: center;
                   1988: l.
                   1989: <name1>expression
                   1990: .TE
                   1991: is a rule which is only recognized when Lex is in the
                   1992: start condition \fIname1\fR.
                   1993: To enter a start condition,
                   1994: execute the action statement
                   1995: .TS
                   1996: center;
                   1997: l.
                   1998: BEGIN name1;
                   1999: .TE
                   2000: which changes the start condition to \fIname1\fR.
                   2001: To resume the normal state,
                   2002: .TS
                   2003: center;
                   2004: l.
                   2005: BEGIN 0;
                   2006: .TE
                   2007: resets the initial condition
                   2008: of the Lex automaton interpreter.
                   2009: A rule may be active in several
                   2010: start conditions:
                   2011: .TS
                   2012: center;
                   2013: l.
                   2014: <name1,name2,name3>
                   2015: .TE
                   2016: is a legal prefix.  Any rule not beginning with the
                   2017: <> prefix operator is always active.
                   2018: .PP
                   2019: The same example as before can be written:
                   2020: .TS
                   2021: center;
                   2022: l l.
                   2023: %START AA BB CC
                   2024: %%
                   2025: ^a     {ECHO; BEGIN AA;}
                   2026: ^b     {ECHO; BEGIN BB;}
                   2027: ^c     {ECHO; BEGIN CC;}
                   2028: \en    {ECHO; BEGIN 0;}
                   2029: <AA>magic      printf("first");
                   2030: <BB>magic      printf("second");
                   2031: <CC>magic      printf("third");
                   2032: .TE
                   2033: where the logic is exactly the same as in the previous
                   2034: method of handling the problem, but Lex does the work
                   2035: rather than the user's code.
                   2036: .2C
                   2037: .NH
                   2038: Character Set.
                   2039: .PP
                   2040: The programs generated by Lex handle
                   2041: character I/O only through the routines
                   2042: .I
                   2043: input,
                   2044: output,
                   2045: .R
                   2046: and
                   2047: .I
                   2048: unput.
                   2049: .R
                   2050: Thus the character representation
                   2051: provided in these routines
                   2052: is accepted by Lex and employed to return
                   2053: values in
                   2054: .I
                   2055: yytext.
                   2056: .R
                   2057: For internal use
                   2058: a character is represented as a small integer
                   2059: which, if the standard library is used,
                   2060: has a value equal to the integer value of the bit
                   2061: pattern representing the character on the host computer.
                   2062: Normally, the letter
                   2063: .I a
                   2064: is represented as the same form as the character constant
                   2065: .I \(fma\(fm .
                   2066: If this interpretation is changed, by providing I/O
                   2067: routines which translate the characters,
                   2068: Lex must be told about
                   2069: it, by giving a translation table.
                   2070: This table must be in the definitions section,
                   2071: and must be bracketed by lines containing  only
                   2072: ``%T''.
                   2073: The table contains lines of the form
                   2074: .TS
                   2075: center;
                   2076: l.
                   2077: {integer} {character string}
                   2078: .TE
                   2079: which indicate the value associated with each character.
                   2080: Thus the next example
                   2081: .GS 2
                   2082: .TS
                   2083: center;
                   2084: l l.
                   2085: %T
                   2086:  1     Aa
                   2087:  2     Bb
                   2088: \&...
                   2089: 26     Zz
                   2090: 27     \en
                   2091: 28     +
                   2092: 29     \-
                   2093: 30     0
                   2094: 31     1
                   2095: \&...
                   2096: 39     9
                   2097: %T
                   2098: .TE
                   2099: .sp
                   2100: .ce 1
                   2101: Sample character table.
                   2102: .GE
                   2103: maps the lower and upper case letters together into the integers 1 through 26,
                   2104: newline into 27, + and \- into 28 and 29, and the
                   2105: digits into 30 through 39.
                   2106: Note the escape for newline.
                   2107: If a table is supplied, every character that is to appear either
                   2108: in the rules or in any valid input must be included
                   2109: in the table.
                   2110: No character
                   2111: may be assigned the number 0, and no character may be
                   2112: assigned a bigger number than the size of the hardware character set.
                   2113: .2C
                   2114: .NH
                   2115: Summary of Source Format.
                   2116: .PP
                   2117: The general form of a Lex source file is:
                   2118: .TS
                   2119: center;
                   2120: l.
                   2121: {definitions}
                   2122: %%
                   2123: {rules}
                   2124: %%
                   2125: {user subroutines}
                   2126: .TE
                   2127: The definitions section contains
                   2128: a combination of
                   2129: .IP 1)
                   2130: Definitions, in the form ``name space translation''.
                   2131: .IP 2)
                   2132: Included code, in the form ``space code''.
                   2133: .IP 3)
                   2134: Included code, in the form
                   2135: .TS
                   2136: center;
                   2137: l.
                   2138: %{
                   2139: code
                   2140: %}
                   2141: .TE
                   2142: .ns
                   2143: .IP 4)
                   2144: Start conditions, given in the form
                   2145: .TS
                   2146: center;
                   2147: l.
                   2148: %S name1 name2 ...
                   2149: .TE
                   2150: .ns
                   2151: .IP 5)
                   2152: Character set tables, in the form
                   2153: .TS
                   2154: center;
                   2155: l.
                   2156: %T
                   2157: number space character-string
                   2158: \&...
                   2159: %T
                   2160: .TE
                   2161: .ns
                   2162: .IP 6)
                   2163: Changes to internal array sizes, in the form
                   2164: .TS
                   2165: center;
                   2166: l.
                   2167: %\fIx\fR\0\0\fInnn\fR
                   2168: .TE
                   2169: where \fInnn\fR is a decimal integer representing an array size
                   2170: and \fIx\fR selects the parameter as follows:
                   2171: .TS
                   2172: center;
                   2173: c c
                   2174: c l.
                   2175: Letter Parameter
                   2176: p      positions
                   2177: n      states
                   2178: e      tree nodes
                   2179: a      transitions
                   2180: k      packed character classes
                   2181: o      output array size
                   2182: .TE
                   2183: .LP
                   2184: Lines in the rules section have the form ``expression  action''
                   2185: where the action may be continued on succeeding
                   2186: lines by using braces to delimit it.
                   2187: .PP
                   2188: Regular expressions in Lex use the following
                   2189: operators:
                   2190: .br
                   2191: .TS
                   2192: center;
                   2193: l l.
                   2194: x      the character "x"
                   2195: "x"    an "x", even if x is an operator.
                   2196: \ex    an "x", even if x is an operator.
                   2197: [xy]   the character x or y.
                   2198: [x\-z] the characters x, y or z.
                   2199: [^x]   any character but x.
                   2200: \&.    any character but newline.
                   2201: ^x     an x at the beginning of a line.
                   2202: <y>x   an x when Lex is in start condition y.
                   2203: x$     an x at the end of a line.
                   2204: x?     an optional x.
                   2205: x\(**  0,1,2, ... instances of x.
                   2206: x+     1,2,3, ... instances of x.
                   2207: x|y    an x or a y.
                   2208: (x)    an x.
                   2209: x/y    an x but only if followed by y.
                   2210: {xx}   the translation of xx from the
                   2211:        definitions section.
                   2212: x{m,n} \fIm\fR through \fIn\fR occurrences of x
                   2213: .TE
                   2214: .NH
                   2215: Caveats and Bugs.
                   2216: .PP
                   2217: There are pathological expressions which
                   2218: produce exponential growth of the tables when
                   2219: converted to deterministic machines;
                   2220: fortunately, they are rare.
                   2221: .PP
                   2222: REJECT does not rescan the input; instead it remembers the results of the previous
                   2223: scan.  This means that if a rule with trailing context is found, and
                   2224: REJECT executed, the user
                   2225: must not have used
                   2226: .ul
                   2227: unput
                   2228: to change the characters forthcoming
                   2229: from the input stream.
                   2230: This is the only restriction on the user's ability to manipulate
                   2231: the not-yet-processed input.
                   2232: .PP
                   2233: .2C
                   2234: .NH
                   2235: Acknowledgments.
                   2236: .PP
                   2237: As should
                   2238: be obvious from the above, the outside of Lex
                   2239: is patterned
                   2240: on Yacc and the inside on Aho's string matching routines.
                   2241: Therefore, both S. C. Johnson and A. V. Aho
                   2242: are really originators
                   2243: of much of Lex,
                   2244: as well as debuggers of it.
                   2245: Many thanks are due to both.
                   2246: .PP
                   2247: The code of the current version of Lex was designed, written,
                   2248: and debugged by Eric Schmidt.
                   2249: .SG MH-1274-MEL-unix
                   2250: .sp 1
                   2251: .2C
                   2252: .NH
                   2253: References.
                   2254: .SP 1v
                   2255: .IP 1.
                   2256: B. W. Kernighan and D. M. Ritchie,
                   2257: .I
                   2258: The C Programming Language,
                   2259: .R
                   2260: Prentice-Hall, N. J. (1978).
                   2261: .IP 2.
                   2262: B. W. Kernighan,
                   2263: .I
                   2264: Ratfor: A Preprocessor for a Rational Fortran,
                   2265: .R
                   2266: Software \- Practice and Experience,
                   2267: \fB5\fR, pp. 395-496 (1975).
                   2268: .IP 3.
                   2269: S. C. Johnson,
                   2270: .I
                   2271: Yacc: Yet Another Compiler Compiler,
                   2272: .R
                   2273: Computing Science Technical Report No. 32,
                   2274: 1975,
                   2275: .MH
                   2276: .if \n(tm (also TM 75-1273-6)
                   2277: .IP 4.
                   2278: A. V. Aho and M. J. Corasick,
                   2279: .I
                   2280: Efficient String Matching: An Aid to Bibliographic Search,
                   2281: .R
                   2282: Comm. ACM
                   2283: .B
                   2284: 18,
                   2285: .R
                   2286: 333-340 (1975).
                   2287: .IP 5.
                   2288: B. W. Kernighan, D. M. Ritchie and K. L. Thompson,
                   2289: .I
                   2290: QED Text Editor,
                   2291: .R
                   2292: Computing Science Technical Report No. 5,
                   2293: 1972,
                   2294: .MH
                   2295: .IP 6.
                   2296: D. M. Ritchie,
                   2297: private communication.
                   2298: See also
                   2299: M. E. Lesk,
                   2300: .I
                   2301: The Portable C Library,
                   2302: .R
                   2303: Computing Science Technical Report No. 31,
                   2304: .MH
                   2305: .if \n(tm (also TM 75-1274-11)

unix.superglobalmegacorp.com

This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.