|
|
1.1 root 1: .\" @(#)lex.ms 6.2 (Berkeley) 5/1/87
2: .\"
3: .EH 'PS1:16-%''Lex \- A Lexical Analyzer Generator'
4: .OH 'Lex \- A Lexical Analyzer Generator''PS1:16-%'
5: .hc ~
6: .bd I 2
7: .de TS
8: .br
9: .nf
10: .SP 1v
11: .ul 0
12: ..
13: .de TE
14: .SP 1v
15: .fi
16: ..
17: .\".de PT
18: .\".if \\n%>1 'tl ''\s7LEX\s0\s9\(mi%\s0''
19: .\".if \\n%>1 'sp
20: .\"..
21: .ND July 21, 1975
22: .\".RP
23: .\".TM 75-1274-15 39199 39199-11
24: .TL
25: Lex \- A Lexical Analyzer ~Generator~
26: .AU ``MH 2C-569'' 6377
27: M. E. Lesk and E. Schmidt
28: .AI
29: .MH
30: .AB
31: .sp
32: .bd I 2
33: .\".nr PS 8
34: .\".nr VS 9
35: .\".ps 8
36: .\".vs 9p
37: Lex helps write programs whose control flow
38: is directed by instances of regular
39: expressions in the input stream.
40: It is well suited for editor-script type transformations and
41: for segmenting input in preparation for
42: a parsing routine.
43: .PP
44: Lex source is a table of regular expressions and corresponding program fragments.
45: The table is translated to a program
46: which reads an input stream, copying it to an output stream
47: and partitioning the input
48: into strings which match the given expressions.
49: As each such string is recognized the corresponding
50: program fragment is executed.
51: The recognition of the expressions
52: is performed by a deterministic finite automaton
53: generated by Lex.
54: The program fragments written by the user are executed in the order in which the
55: corresponding regular expressions occur in the input stream.
56: .if n .if \n(tm .ig
57: .PP
58: The lexical analysis
59: programs written with Lex accept ambiguous specifications
60: and choose the longest
61: match possible at each input point.
62: If necessary, substantial look~ahead
63: is performed on the input, but the
64: input stream will be backed up to the
65: end of the current partition, so that the user
66: has general freedom to manipulate it.
67: .PP
68: Lex can generate analyzers in either C or Ratfor, a language
69: which can be translated automatically to portable Fortran.
70: It is available on the PDP-11 UNIX, Honeywell GCOS,
71: and IBM OS systems.
72: This manual, however, will only discuss generating analyzers
73: in C on the UNIX system, which is the only supported
74: form of Lex under UNIX Version 7.
75: Lex is designed to simplify
76: interfacing with Yacc, for those
77: with access to this compiler-compiler system.
78: ..
79: .\".nr PS 9
80: .\".nr VS 11
81: .AE
82: .2C
83: .NH
84: Introduction.
85: .PP
86: Lex is a program generator designed for
87: lexical processing of character input streams.
88: It accepts a high-level, problem oriented specification
89: for character string matching,
90: and
91: produces a program in a general purpose language which recognizes
92: regular expressions.
93: The regular expressions are specified by the user in the
94: source specifications given to Lex.
95: The Lex written code recognizes these expressions
96: in an input stream and partitions the input stream into
97: strings matching the expressions. At the bound~aries
98: between strings
99: program sections
100: provided by the user are executed.
101: The Lex source file associates the regular expressions and the
102: program fragments.
103: As each expression appears in the input to the program written by Lex,
104: the corresponding fragment is executed.
105: .PP
106: .de MH
107: Bell Laboratories, Murray Hill, NJ 07974.
108: ..
109: The user supplies the additional code
110: beyond expression matching
111: needed to complete his tasks, possibly
112: including code written by other generators.
113: The program that recognizes the expressions is generated in the
114: general purpose programming language employed for the
115: user's program fragments.
116: Thus, a high level expression
117: language is provided to write the string expressions to be
118: matched while the user's freedom to write actions
119: is unimpaired.
120: This avoids forcing the user who wishes to use a string manipulation
121: language for input analysis to write processing programs in the same
122: and often inappropriate string handling language.
123: .PP
124: Lex is not a complete language, but rather a generator representing
125: a new language feature which can be added to
126: different programming languages, called ``host languages.''
127: Just as general purpose languages
128: can produce code to run on different computer hardware,
129: Lex can write code in different host languages.
130: The host language is used for the output code generated by Lex
131: and also for the program fragments added by the user.
132: Compatible run-time libraries for the different host languages
133: are also provided.
134: This makes Lex adaptable to different environments and
135: different users.
136: Each application
137: may be directed to the combination of hardware and host language appropriate
138: to the task, the user's background, and the properties of local
139: implementations.
140: At present, the only supported host language is C,
141: although Fortran (in the form of Ratfor [2] has been available
142: in the past.
143: Lex itself exists on UNIX, GCOS, and OS/370; but the
144: code generated by Lex may be taken anywhere the appropriate
145: compilers exist.
146: .PP
147: Lex turns the user's expressions and actions
148: (called
149: .ul
150: source
151: in this memo) into the host general-purpose language;
152: the generated program is named
153: .ul
154: yylex.
155: The
156: .ul
157: yylex
158: program
159: will recognize expressions
160: in a stream
161: (called
162: .ul
163: input
164: in this memo)
165: and perform the specified actions for each expression as it is detected.
166: See Figure 1.
167: .GS
168: .TS
169: center;
170: l _ r
171: l|c|r
172: l _ r
173: l _ r
174: l|c|r
175: l _ r
176: c s s
177: c s s.
178:
179: Source \(-> Lex \(-> yylex
180:
181: .sp 2
182:
183: Input \(-> yylex \(-> Output
184:
185: .sp
186: An overview of Lex
187: Figure 1
188: .TE
189: .GE
190: .PP
191: For a trivial example, consider a program to delete
192: from the input
193: all blanks or tabs at the ends of lines.
194: .TS
195: center;
196: l l.
197: %%
198: [ \et]+$ ;
199: .TE
200: is all that is required.
201: The program
202: contains a %% delimiter to mark the beginning of the rules, and
203: one rule.
204: This rule contains a regular expression
205: which matches one or more
206: instances of the characters blank or tab
207: (written \et for visibility, in accordance with the C language convention)
208: just prior to the end of a line.
209: The brackets indicate the character
210: class made of blank and tab; the + indicates ``one or more ...'';
211: and the $ indicates ``end of line,'' as in QED.
212: No action is specified,
213: so the program generated by Lex (yylex) will ignore these characters.
214: Everything else will be copied.
215: To change any remaining
216: string of blanks or tabs to a single blank,
217: add another rule:
218: .TS
219: center;
220: l l.
221: %%
222: [ \et]+$ ;
223: [ \et]+ printf(" ");
224: .TE
225: The finite automaton generated for this
226: source will scan for both rules at once,
227: observing at
228: the termination of the string of blanks or tabs
229: whether or not there is a newline character, and executing
230: the desired rule action.
231: The first rule matches all strings of blanks or tabs
232: at the end of lines, and the second
233: rule all remaining strings of blanks or tabs.
234: .PP
235: Lex can be used alone for simple transformations, or
236: for analysis and statistics gathering on a lexical level.
237: Lex can also be used with a parser generator
238: to perform the lexical analysis phase; it is particularly
239: easy to interface Lex and Yacc [3].
240: Lex programs recognize only regular expressions;
241: Yacc writes parsers that accept a large class of context free grammars,
242: but require a lower level analyzer to recognize input tokens.
243: Thus, a combination of Lex and Yacc is often appropriate.
244: When used as a preprocessor for a later parser generator,
245: Lex is used to partition the input stream,
246: and the parser generator assigns structure to
247: the resulting pieces.
248: The flow of control
249: in such a case (which might be the first half of a compiler,
250: for example) is shown in Figure 2.
251: Additional programs,
252: written by other generators
253: or by hand, can
254: be added easily to programs written by Lex.
255: .BS 2
256: .ps 9
257: .vs 11
258: .TS
259: center;
260: l c c c l
261: l c c c l
262: l c c c l
263: l _ c _ l
264: l|c|c|c|l
265: l _ c _ l
266: l c c c l
267: l _ c _ l
268: l|c|c|c|l
269: l _ c _ l
270: l c s s l
271: l c s s l.
272: lexical grammar
273: rules rules
274: \(da \(da
275:
276: Lex Yacc
277:
278: \(da \(da
279:
280: Input \(-> yylex \(-> yyparse \(-> Parsed input
281:
282: .sp
283: Lex with Yacc
284: Figure 2
285: .TE
286: .ps 10
287: .vs 12
288: .BE
289: Yacc users
290: will realize that the name
291: .ul
292: yylex
293: is what Yacc expects its lexical analyzer to be named,
294: so that the use of this name by Lex simplifies
295: interfacing.
296: .PP
297: Lex generates a deterministic finite automaton from the regular expressions
298: in the source [4].
299: The automaton is interpreted, rather than compiled, in order
300: to save space.
301: The result is still a fast analyzer.
302: In particular, the time taken by a Lex program
303: to recognize and partition an input stream is
304: proportional to the length of the input.
305: The number of Lex rules or
306: the complexity of the rules is
307: not important in determining speed,
308: unless rules which include
309: forward context require a significant amount of re~scanning.
310: What does increase with the number and complexity of rules
311: is the size of the finite
312: automaton, and therefore the size of the program
313: generated by Lex.
314: .PP
315: In the program written by Lex, the user's fragments
316: (representing the
317: .ul
318: actions
319: to be performed as each regular expression
320: is found)
321: are gathered
322: as cases of a switch.
323: The automaton interpreter directs the control flow.
324: Opportunity is provided for the user to insert either
325: declarations or additional statements in the routine containing
326: the actions, or to
327: add subroutines outside this action routine.
328: .PP
329: Lex is not limited to source which can
330: be interpreted on the basis of one character
331: look~ahead.
332: For example,
333: if there are two rules, one looking for
334: .I ab
335: and another for
336: .I abcdefg ,
337: and the input stream is
338: .I abcdefh ,
339: Lex will recognize
340: .I ab
341: and leave
342: the input pointer just before
343: .I "cd. . ."
344: Such backup is more costly
345: than the processing of simpler languages.
346: .2C
347: .NH
348: Lex Source.
349: .PP
350: The general format of Lex source is:
351: .TS
352: center;
353: l.
354: {definitions}
355: %%
356: {rules}
357: %%
358: {user subroutines}
359: .TE
360: where the definitions and the user subroutines
361: are often omitted.
362: The second
363: .I %%
364: is optional, but the first is required
365: to mark the beginning of the rules.
366: The absolute minimum Lex program is thus
367: .TS
368: center;
369: l.
370: %%
371: .TE
372: (no definitions, no rules) which translates into a program
373: which copies the input to the output unchanged.
374: .PP
375: In the outline of Lex programs shown above, the
376: .I
377: rules
378: .R
379: represent the user's control
380: decisions; they are a table, in which the left column
381: contains
382: .I
383: regular expressions
384: .R
385: (see section 3)
386: and the right column contains
387: .I
388: actions,
389: .R
390: program fragments to be executed when the expressions
391: are recognized.
392: Thus an individual rule might appear
393: .TS
394: center;
395: l l.
396: integer printf("found keyword INT");
397: .TE
398: to look for the string
399: .I integer
400: in the input stream and
401: print the message ``found keyword INT'' whenever it appears.
402: In this example the host procedural language is C and
403: the C library function
404: .I
405: printf
406: .R
407: is used to print the string.
408: The end
409: of the expression is indicated by the first blank or tab character.
410: If the action is merely a single C expression,
411: it can just be given on the right side of the line; if it is
412: compound, or takes more than a line, it should be enclosed in
413: braces.
414: As a slightly more useful example, suppose it is desired to
415: change a number of words from British to American spelling.
416: Lex rules such as
417: .TS
418: center;
419: l l.
420: colour printf("color");
421: mechanise printf("mechanize");
422: petrol printf("gas");
423: .TE
424: would be a start. These rules are not quite enough,
425: since
426: the word
427: .I petroleum
428: would become
429: .I gaseum ;
430: a way of dealing
431: with this will be described later.
432: .2C
433: .NH
434: Lex Regular Expressions.
435: .PP
436: The definitions of regular expressions are very similar to those
437: in QED [5].
438: A regular
439: expression specifies a set of strings to be matched.
440: It contains text characters (which match the corresponding
441: characters in the strings being compared)
442: and operator characters (which specify
443: repetitions, choices, and other features).
444: The letters of the alphabet and the digits are
445: always text characters; thus the regular expression
446: .TS
447: center;
448: l l.
449: integer
450: .TE
451: matches the string
452: .ul
453: integer
454: wherever it appears
455: and the expression
456: .TS
457: center;
458: l.
459: a57D
460: .TE
461: looks for the string
462: .ul
463: a57D.
464: .PP
465: .I
466: Operators.
467: .R
468: The operator characters are
469: .TS
470: center;
471: l.
472: " \e [ ] ^ \- ? . \(** + | ( ) $ / { } % < >
473: .TE
474: and if they are to be used as text characters, an escape
475: should be used.
476: The quotation mark operator (")
477: indicates that whatever is contained between a pair of quotes
478: is to be taken as text characters.
479: Thus
480: .TS
481: center;
482: l.
483: xyz"++"
484: .TE
485: matches the string
486: .I xyz++
487: when it appears. Note that a part of a string may be quoted.
488: It is harmless but unnecessary to quote an ordinary
489: text character; the expression
490: .TS
491: center;
492: l.
493: "xyz++"
494: .TE
495: is the same as the one above.
496: Thus by quoting every non-alphanumeric character
497: being used as a text character, the user can avoid remembering
498: the list above of current
499: operator characters, and is safe should further extensions to Lex
500: lengthen the list.
501: .PP
502: An operator character may also be turned into a text character
503: by preceding it with \e as in
504: .TS
505: center;
506: l.
507: xyz\e+\e+
508: .TE
509: which
510: is another, less readable, equivalent of the above expressions.
511: Another use of the quoting mechanism is to get a blank into
512: an expression; normally, as explained above, blanks or tabs end
513: a rule.
514: Any blank character not contained within [\|] (see below) must
515: be quoted.
516: Several normal C escapes with \e
517: are recognized: \en is newline, \et is tab, and \eb is backspace.
518: To enter \e itself, use \e\e.
519: Since newline is illegal in an expression, \en must be used;
520: it is not
521: required to escape tab and backspace.
522: Every character but blank, tab, newline and the list above is always
523: a text character.
524: .PP
525: .I
526: Character classes.
527: .R
528: Classes of characters can be specified using the operator pair [\|].
529: The construction
530: .I [abc]
531: matches a
532: single character, which may be
533: .I a ,
534: .I b ,
535: or
536: .I c .
537: Within square brackets,
538: most operator meanings are ignored.
539: Only three characters are special:
540: these are \e \(mi and ^. The \(mi character
541: indicates ranges. For example,
542: .TS
543: center;
544: l.
545: [a\(miz0\(mi9<>_]
546: .TE
547: indicates the character class containing all the lower case letters,
548: the digits,
549: the angle brackets, and underline.
550: Ranges may be given in either order.
551: Using \(mi between any pair of characters which are
552: not both upper case letters, both lower case letters, or both digits
553: is implementation dependent and will get a warning message.
554: (E.g., [0\-z] in ASCII is many more characters
555: than it is in EBCDIC).
556: If it is desired to include the
557: character \(mi in a character class, it should be first or
558: last; thus
559: .TS
560: center;
561: l.
562: [\(mi+0\(mi9]
563: .TE
564: matches all the digits and the two signs.
565: .PP
566: In character classes,
567: the ^ operator must appear as the first character
568: after the left bracket; it indicates that the resulting string
569: is to be complemented with respect to the computer character set.
570: Thus
571: .TS
572: center;
573: l.
574: [^abc]
575: .TE
576: matches all characters except a, b, or c, including
577: all special or control characters; or
578: .TS
579: center;
580: l.
581: [^a\-zA\-Z]
582: .TE
583: is any character which is not a letter.
584: The \e character provides the usual escapes within
585: character class brackets.
586: .PP
587: .I
588: Arbitrary character.
589: .R
590: To match almost any character, the operator character
591: .TS
592: center;
593: l.
594: \&.
595: .TE
596: is the class of all characters except newline.
597: Escaping into octal is possible although non-portable:
598: .TS
599: center;
600: l.
601: [\e40\-\e176]
602: .TE
603: matches all printable characters in the ASCII character set, from octal
604: 40 (blank) to octal 176 (tilde).
605: .PP
606: .I
607: Optional expressions.
608: .R
609: The operator
610: .I ?
611: indicates
612: an optional element of an expression.
613: Thus
614: .TS
615: center;
616: l.
617: ab?c
618: .TE
619: matches either
620: .I ac
621: or
622: .I abc .
623: .PP
624: .I
625: Repeated expressions.
626: .R
627: Repetitions of classes are indicated by the operators
628: .I \(**
629: and
630: .I + .
631: .TS
632: center;
633: l.
634: \f2a\(**\f1
635: .TE
636: is any number of consecutive
637: .I a
638: characters, including zero; while
639: .TS
640: center;
641: l.
642: a+
643: .TE
644: is one or more instances of
645: .I a.
646: For example,
647: .TS
648: center;
649: l.
650: [a\-z]+
651: .TE
652: is all strings of lower case letters.
653: And
654: .TS
655: center;
656: l.
657: [A\(miZa\(miz][A\(miZa\(miz0\(mi9]\(**
658: .TE
659: indicates all alphanumeric strings with a leading
660: alphabetic character.
661: This is a typical expression for recognizing identifiers in
662: computer languages.
663: .PP
664: .I
665: Alternation and Grouping.
666: .R
667: The operator |
668: indicates alternation:
669: .TS
670: center;
671: l.
672: (ab\||\|cd)
673: .TE
674: matches either
675: .ul
676: ab
677: or
678: .ul
679: cd.
680: Note that parentheses are used for grouping, although
681: they are
682: not necessary on the outside level;
683: .TS
684: center;
685: l.
686: ab\||\|cd
687: .TE
688: would have sufficed.
689: Parentheses
690: can be used for more complex expressions:
691: .TS
692: center;
693: l.
694: (ab\||\|cd+)?(ef)\(**
695: .TE
696: matches such strings as
697: .I abefef ,
698: .I efefef ,
699: .I cdef ,
700: or
701: .I cddd\| ;
702: but not
703: .I abc ,
704: .I abcd ,
705: or
706: .I abcdef .
707: .PP
708: .I
709: Context sensitivity.
710: .R
711: Lex will recognize a small amount of surrounding
712: context. The two simplest operators for this are
713: .I ^
714: and
715: .I $ .
716: If the first character of an expression is
717: .I ^ ,
718: the expression will only be matched at the beginning
719: of a line (after a newline character, or at the beginning of
720: the input stream).
721: This can never conflict with the other meaning of
722: .I ^ ,
723: complementation
724: of character classes, since that only applies within
725: the [\|] operators.
726: If the very last character is
727: .I $ ,
728: the expression will only be matched at the end of a line (when
729: immediately followed by newline).
730: The latter operator is a special case of the
731: .I /
732: operator character,
733: which indicates trailing context.
734: The expression
735: .TS
736: center;
737: l.
738: ab/cd
739: .TE
740: matches the string
741: .I ab ,
742: but only if followed by
743: .ul
744: cd.
745: Thus
746: .TS
747: center;
748: l.
749: ab$
750: .TE
751: is the same as
752: .TS
753: center;
754: l.
755: ab/\en
756: .TE
757: Left context is handled in Lex by
758: .I
759: start conditions
760: .R
761: as explained in section 10. If a rule is only to be executed
762: when the Lex automaton interpreter is in start condition
763: .I
764: x,
765: .R
766: the rule should be prefixed by
767: .TS
768: center;
769: l.
770: <x>
771: .TE
772: using the angle bracket operator characters.
773: If we considered ``being at the beginning of a line'' to be
774: start condition
775: .I ONE ,
776: then the ^ operator
777: would be equivalent to
778: .TS
779: center;
780: l.
781: <ONE>
782: .TE
783: Start conditions are explained more fully later.
784: .PP
785: .I
786: Repetitions and Definitions.
787: .R
788: The operators {} specify
789: either repetitions (if they enclose numbers)
790: or
791: definition expansion (if they enclose a name). For example
792: .TS
793: center;
794: l.
795: {digit}
796: .TE
797: looks for a predefined string named
798: .I digit
799: and inserts it
800: at that point in the expression.
801: The definitions are given in the first part of the Lex
802: input, before the rules.
803: In contrast,
804: .TS
805: center;
806: l.
807: a{1,5}
808: .TE
809: looks for 1 to 5 occurrences of
810: .I a .
811: .PP
812: Finally, initial
813: .I %
814: is special, being the separator
815: for Lex source segments.
816: .2C
817: .NH
818: Lex Actions.
819: .PP
820: When an expression written as above is matched, Lex
821: executes the corresponding action. This section describes
822: some features of Lex which aid in writing actions. Note
823: that there is a default action, which
824: consists of copying the input to the output. This
825: is performed on all strings not otherwise matched. Thus
826: the Lex user who wishes to absorb the entire input, without
827: producing any output, must provide rules to match everything.
828: When Lex is being used with Yacc, this is the normal
829: situation.
830: One may consider that actions are what is done instead of
831: copying the input to the output; thus, in general,
832: a rule which merely copies can be omitted.
833: Also, a character combination
834: which is omitted from the rules
835: and which appears as input
836: is likely to be printed on the output, thus calling
837: attention to the gap in the rules.
838: .PP
839: One of the simplest things that can be done is to ignore
840: the input. Specifying a C null statement, \fI;\fR as an action
841: causes this result. A frequent rule is
842: .TS
843: center;
844: l l.
845: [ \et\en] ;
846: .TE
847: which causes the three spacing characters (blank, tab, and newline)
848: to be ignored.
849: .PP
850: Another easy way to avoid writing actions is the action character
851: |, which indicates that the action for this rule is the action
852: for the next rule.
853: The previous example could also have been written
854: .TS
855: center;
856: l l.
857: " " |
858: "\et" |
859: "\en" ;
860: .TE
861: with the same result, although in different style.
862: The quotes around \en and \et are not required.
863: .PP
864: In more complex actions, the user
865: will
866: often want to know the actual text that matched some expression
867: like
868: .I [a\(miz]+ .
869: Lex leaves this text in an external character
870: array named
871: .I
872: yytext.
873: .R
874: Thus, to print the name found,
875: a rule like
876: .TS
877: center;
878: l l.
879: [a\-z]+ printf("%s", yytext);
880: .TE
881: will print
882: the string in
883: .I
884: yytext.
885: .R
886: The C function
887: .I
888: printf
889: .R
890: accepts a format argument and data to be printed;
891: in this case, the format is ``print string'' (% indicating
892: data conversion, and
893: .I s
894: indicating string type),
895: and the data are the characters
896: in
897: .I
898: yytext.
899: .R
900: So this just places
901: the matched string
902: on the output.
903: This action
904: is so common that
905: it may be written as ECHO:
906: .TS
907: center;
908: l l.
909: [a\-z]+ ECHO;
910: .TE
911: is the same as the above.
912: Since the default action is just to
913: print the characters found, one might ask why
914: give a rule, like this one, which merely specifies
915: the default action?
916: Such rules are often required
917: to avoid matching some other rule
918: which is not desired. For example, if there is a rule
919: which matches
920: .I read
921: it will normally match the instances of
922: .I read
923: contained in
924: .I bread
925: or
926: .I readjust ;
927: to avoid
928: this,
929: a rule
930: of the form
931: .I [a\(miz]+
932: is needed.
933: This is explained further below.
934: .PP
935: Sometimes it is more convenient to know the end of what
936: has been found; hence Lex also provides a count
937: .I
938: yyleng
939: .R
940: of the number of characters matched.
941: To count both the number
942: of words and the number of characters in words in the input, the user might write
943: .TS
944: center;
945: l l.
946: [a\-zA\-Z]+ {words++; chars += yyleng;}
947: .TE
948: which accumulates in
949: .ul
950: chars
951: the number
952: of characters in the words recognized.
953: The last character in the string matched can
954: be accessed by
955: .TS
956: center;
957: l.
958: yytext[yyleng\-1]
959: .TE
960: .PP
961: Occasionally, a Lex
962: action may decide that a rule has not recognized the correct
963: span of characters.
964: Two routines are provided to aid with this situation.
965: First,
966: .I
967: yymore()
968: .R
969: can be called to indicate that the next input expression recognized is to be
970: tacked on to the end of this input. Normally,
971: the next input string would overwrite the current
972: entry in
973: .I
974: yytext.
975: .R
976: Second,
977: .I
978: yyless (n)
979: .R
980: may be called to indicate that not all the characters matched
981: by the currently successful expression are wanted right now.
982: The argument
983: .I
984: n
985: .R
986: indicates the number of characters
987: in
988: .I
989: yytext
990: .R
991: to be retained.
992: Further characters previously matched
993: are
994: returned to the input. This provides the same sort of
995: look~ahead offered by the / operator,
996: but in a different form.
997: .PP
998: .I
999: Example:
1000: .R
1001: Consider a language which defines
1002: a string as a set of characters between quotation (") marks, and provides that
1003: to include a " in a string it must be preceded by a \e. The
1004: regular expression which matches that is somewhat confusing,
1005: so that it might be preferable to write
1006: .TS
1007: center;
1008: l l.
1009: \e"[^"]\(** {
1010: if (yytext[yyleng\-1] == \(fm\e\e\(fm)
1011: yymore();
1012: else
1013: ... normal user processing
1014: }
1015: .TE
1016: which will, when faced with a string such as
1017: .I
1018: "abc\e"def\|"
1019: .R
1020: first match
1021: the five characters
1022: \fI"abc\e\|\fR;
1023: then
1024: the call to
1025: .I yymore()
1026: will
1027: cause the next part of the string,
1028: \fI"def\|\fR,
1029: to be tacked on the end.
1030: Note that the final quote terminating the string should be picked
1031: up in the code labeled ``normal processing''.
1032: .PP
1033: The function
1034: .I
1035: yyless()
1036: .R
1037: might be used to reprocess
1038: text in various circumstances. Consider the C problem of distinguishing
1039: the ambiguity of ``=\(mia''.
1040: Suppose it is desired to treat this as ``=\(mi a''
1041: but print a message. A rule might be
1042: .ps 9
1043: .vs 11
1044: .TS
1045: center;
1046: l l.
1047: =\(mi[a\-zA\-Z] {
1048: printf("Op (=\(mi) ambiguous\en");
1049: yyless(yyleng\-1);
1050: ... action for =\(mi ...
1051: }
1052: .TE
1053: .ps 10
1054: .vs 12
1055: which prints a message, returns the letter after the
1056: operator to the input stream, and treats the operator as ``=\(mi''.
1057: Alternatively it might be desired to treat this as ``= \(mia''.
1058: To do this, just return the minus
1059: sign as well as the letter to the input:
1060: .ps 9
1061: .vs 11
1062: .TS
1063: center;
1064: l l.
1065: =\(mi[a\-zA\-Z] {
1066: printf("Op (=\(mi) ambiguous\en");
1067: yyless(yyleng\-2);
1068: ... action for = ...
1069: }
1070: .TE
1071: .ps 10
1072: .vs 12
1073: will perform the other interpretation.
1074: Note that the expressions for the two cases might more easily
1075: be written
1076: .TS
1077: center;
1078: l l.
1079: =\(mi/[A\-Za\-z]
1080: .TE
1081: in the first case and
1082: .TS
1083: center;
1084: l.
1085: =/\-[A\-Za\-z]
1086: .TE
1087: in the second;
1088: no backup would be required in the rule action.
1089: It is not necessary to recognize the whole identifier
1090: to observe the ambiguity.
1091: The
1092: possibility of ``=\(mi3'', however, makes
1093: .TS
1094: center;
1095: l.
1096: =\(mi/[^ \et\en]
1097: .TE
1098: a still better rule.
1099: .PP
1100: In addition to these routines, Lex also permits
1101: access to the I/O routines
1102: it uses.
1103: They are:
1104: .IP 1)
1105: .I
1106: input()
1107: .R
1108: which returns the next input character;
1109: .IP 2)
1110: .I
1111: output(c)
1112: .R
1113: which writes the character
1114: .I
1115: c
1116: .R
1117: on the output; and
1118: .IP 3)
1119: .I
1120: unput(c)
1121: .R
1122: pushes the character
1123: .I
1124: c
1125: .R
1126: back onto the input stream to be read later by
1127: .I
1128: input().
1129: .R
1130: .LP
1131: By default these routines are provided as macro definitions,
1132: but the user can override them and supply private versions.
1133: These routines
1134: define the relationship between external files and
1135: internal characters, and must all be retained
1136: or modified consistently.
1137: They may be redefined, to
1138: cause input or output to be transmitted to or from strange
1139: places, including other programs or internal memory;
1140: but the character set used must be consistent in all routines;
1141: a value of zero returned by
1142: .I
1143: input
1144: .R
1145: must mean end of file; and
1146: the relationship between
1147: .I
1148: unput
1149: .R
1150: and
1151: .I
1152: input
1153: .R
1154: must be retained
1155: or the Lex look~ahead will not work.
1156: Lex does not look ahead at all if it does not have to,
1157: but every rule ending in
1158: .ft I
1159: + \(** ?
1160: .ft R
1161: or
1162: .ft I
1163: $
1164: .ft R
1165: or containing
1166: .ft I
1167: /
1168: .ft R
1169: implies look~ahead.
1170: Look~ahead is also necessary to match an expression that is a prefix
1171: of another expression.
1172: See below for a discussion of the character set used by Lex.
1173: The standard Lex library imposes
1174: a 100 character limit on backup.
1175: .PP
1176: Another Lex library routine that the user will sometimes want
1177: to redefine is
1178: .I
1179: yywrap()
1180: .R
1181: which is called whenever Lex reaches an end-of-file.
1182: If
1183: .I
1184: yywrap
1185: .R
1186: returns a 1, Lex continues with the normal wrapup on end of input.
1187: Sometimes, however, it is convenient to arrange for more
1188: input to arrive
1189: from a new source.
1190: In this case, the user should provide
1191: a
1192: .I
1193: yywrap
1194: .R
1195: which
1196: arranges for new input and
1197: returns 0. This instructs Lex to continue processing.
1198: The default
1199: .I
1200: yywrap
1201: .R
1202: always returns 1.
1203: .PP
1204: This routine is also a convenient place
1205: to print tables, summaries, etc. at the end
1206: of a program. Note that it is not
1207: possible to write a normal rule which recognizes
1208: end-of-file; the only access to this condition is
1209: through
1210: .I
1211: yywrap.
1212: .R
1213: In fact, unless a private version of
1214: .I
1215: input()
1216: .R
1217: is supplied
1218: a file containing nulls
1219: cannot be handled,
1220: since a value of 0 returned by
1221: .I
1222: input
1223: .R
1224: is taken to be end-of-file.
1225: .PP
1226: .2C
1227: .NH
1228: Ambiguous Source Rules.
1229: .PP
1230: Lex can handle ambiguous specifications.
1231: When more than one expression can match the
1232: current input, Lex chooses as follows:
1233: .IP 1)
1234: The longest match is preferred.
1235: .IP 2)
1236: Among rules which matched the same number of characters,
1237: the rule given first is preferred.
1238: .LP
1239: Thus, suppose the rules
1240: .TS
1241: center;
1242: l l.
1243: integer keyword action ...;
1244: [a\-z]+ identifier action ...;
1245: .TE
1246: to be given in that order. If the input is
1247: .I integers ,
1248: it is taken as an identifier, because
1249: .I [a\-z]+
1250: matches 8 characters while
1251: .I integer
1252: matches only 7.
1253: If the input is
1254: .I integer ,
1255: both rules match 7 characters, and
1256: the keyword rule is selected because it was given first.
1257: Anything shorter (e.g. \fIint\fR\|) will
1258: not match the expression
1259: .I integer
1260: and so the identifier interpretation is used.
1261: .PP
1262: The principle of preferring the longest
1263: match makes rules containing
1264: expressions like
1265: .I \&.\(**
1266: dangerous.
1267: For example,
1268: .TS
1269: center;
1270: l.
1271: \&\(fm.\(**\(fm
1272: .TE
1273: might seem a good way of recognizing
1274: a string in single quotes.
1275: But it is an invitation for the program to read far
1276: ahead, looking for a distant
1277: single quote.
1278: Presented with the input
1279: .TS
1280: center;
1281: l l.
1282: \&\(fmfirst\(fm quoted string here, \(fmsecond\(fm here
1283: .TE
1284: the above expression will match
1285: .TS
1286: center;
1287: l l.
1288: \&\(fmfirst\(fm quoted string here, \(fmsecond\(fm
1289: .TE
1290: which is probably not what was wanted.
1291: A better rule is of the form
1292: .TS
1293: center;
1294: l.
1295: \&\(fm[^\(fm\en]\(**\(fm
1296: .TE
1297: which, on the above input, will stop
1298: after
1299: .I \(fmfirst\(fm .
1300: The consequences
1301: of errors like this are mitigated by the fact
1302: that the
1303: .I \&.
1304: operator will not match newline.
1305: Thus expressions like
1306: .I \&.\(**
1307: stop on the
1308: current line.
1309: Don't try to defeat this with expressions like
1310: .I (.|\en)+
1311: or
1312: equivalents;
1313: the Lex generated program will try to read
1314: the entire input file, causing
1315: internal buffer overflows.
1316: .PP
1317: Note that Lex is normally partitioning
1318: the input stream, not searching for all possible matches
1319: of each expression.
1320: This means that each character is accounted for
1321: once and only once.
1322: For example, suppose it is desired to
1323: count occurrences of both \fIshe\fR and \fIhe\fR in an input text.
1324: Some Lex rules to do this might be
1325: .TS
1326: center;
1327: l l.
1328: she s++;
1329: he h++;
1330: \en |
1331: \&. ;
1332: .TE
1333: where the last two rules ignore everything besides \fIhe\fR and \fIshe\fR.
1334: Remember that . does not include newline.
1335: Since \fIshe\fR includes \fIhe\fR, Lex will normally
1336: .I
1337: not
1338: .R
1339: recognize
1340: the instances of \fIhe\fR included in \fIshe\fR,
1341: since once it has passed a \fIshe\fR those characters are gone.
1342: .PP
1343: Sometimes the user would like to override this choice. The action
1344: REJECT
1345: means ``go do the next alternative.''
1346: It causes whatever rule was second choice after the current
1347: rule to be executed.
1348: The position of the input pointer is adjusted accordingly.
1349: Suppose the user really wants to count the included instances of \fIhe\fR:
1350: .TS
1351: center;
1352: l l.
1353: she {s++; REJECT;}
1354: he {h++; REJECT;}
1355: \en |
1356: \&. ;
1357: .TE
1358: these rules are one way of changing the previous example
1359: to do just that.
1360: After counting each expression, it is rejected; whenever appropriate,
1361: the other expression will then be counted. In this example, of course,
1362: the user could note that \fIshe\fR includes \fIhe\fR but not
1363: vice versa, and omit the REJECT action on \fIhe\fR;
1364: in other cases, however, it
1365: would not be possible a priori to tell
1366: which input characters
1367: were in both classes.
1368: .PP
1369: Consider the two rules
1370: .TS
1371: center;
1372: l l.
1373: a[bc]+ { ... ; REJECT;}
1374: a[cd]+ { ... ; REJECT;}
1375: .TE
1376: If the input is
1377: .I ab ,
1378: only the first rule matches,
1379: and on
1380: .I ad
1381: only the second matches.
1382: The input string
1383: .I accb
1384: matches the first rule for four characters
1385: and then the second rule for three characters.
1386: In contrast, the input
1387: .I accd
1388: agrees with
1389: the second rule for four characters and then the first
1390: rule for three.
1391: .PP
1392: In general, REJECT is useful whenever
1393: the purpose of Lex is not to partition the input
1394: stream but to detect all examples of some items
1395: in the input, and the instances of these items
1396: may overlap or include each other.
1397: Suppose a digram table of the input is desired;
1398: normally the digrams overlap, that is the word
1399: .I the
1400: is considered to contain
1401: both
1402: .I th
1403: and
1404: .I he .
1405: Assuming a two-dimensional array named
1406: .ul
1407: digram
1408: to be incremented, the appropriate
1409: source is
1410: .TS
1411: center;
1412: l l.
1413: %%
1414: [a\-z][a\-z] {
1415: digram[yytext[0]][yytext[1]]++;
1416: REJECT;
1417: }
1418: \. ;
1419: \en ;
1420: .TE
1421: where the REJECT is necessary to pick up
1422: a letter pair beginning at every character, rather than at every
1423: other character.
1424: .2C
1425: .NH
1426: Lex Source Definitions.
1427: .PP
1428: Remember the format of the Lex
1429: source:
1430: .TS
1431: center;
1432: l.
1433: {definitions}
1434: %%
1435: {rules}
1436: %%
1437: {user routines}
1438: .TE
1439: So far only the rules have been described. The user needs
1440: additional options,
1441: though, to define variables for use in his program and for use
1442: by Lex.
1443: These can go either in the definitions section
1444: or in the rules section.
1445: .PP
1446: Remember that Lex is turning the rules into a program.
1447: Any source not intercepted by Lex is copied
1448: into the generated program. There are three classes
1449: of such things.
1450: .IP 1)
1451: Any line which is not part of a Lex rule or action
1452: which begins with a blank or tab is copied into
1453: the Lex generated program.
1454: Such source input prior to the first %% delimiter will be external
1455: to any function in the code; if it appears immediately after the first
1456: %%,
1457: it appears in an appropriate place for declarations
1458: in the function written by Lex which contains the actions.
1459: This material must look like program fragments,
1460: and should precede the first Lex rule.
1461: .IP
1462: As a side effect of the above, lines which begin with a blank
1463: or tab, and which contain a comment,
1464: are passed through to the generated program.
1465: This can be used to include comments in either the Lex source or
1466: the generated code. The comments should follow the host
1467: language convention.
1468: .IP 2)
1469: Anything included between lines containing
1470: only
1471: .I %{
1472: and
1473: .I %}
1474: is
1475: copied out as above. The delimiters are discarded.
1476: This format permits entering text like preprocessor statements that
1477: must begin in column 1,
1478: or copying lines that do not look like programs.
1479: .IP 3)
1480: Anything after the third %% delimiter, regardless of formats, etc.,
1481: is copied out after the Lex output.
1482: .PP
1483: Definitions intended for Lex are given
1484: before the first %% delimiter. Any line in this section
1485: not contained between %{ and %}, and begining
1486: in column 1, is assumed to define Lex substitution strings.
1487: The format of such lines is
1488: .TS
1489: center;
1490: l l.
1491: name translation
1492: .TE
1493: and it
1494: causes the string given as a translation to
1495: be associated with the name.
1496: The name and translation
1497: must be separated by at least one blank or tab, and the name must begin with a letter.
1498: The translation can then be called out
1499: by the {name} syntax in a rule.
1500: Using {D} for the digits and {E} for an exponent field,
1501: for example, might abbreviate rules to recognize numbers:
1502: .TS
1503: center;
1504: l l.
1505: D [0\-9]
1506: E [DEde][\-+]?{D}+
1507: %%
1508: {D}+ printf("integer");
1509: {D}+"."{D}\(**({E})? |
1510: {D}\(**"."{D}+({E})? |
1511: {D}+{E} printf("real");
1512: .TE
1513: Note the first two rules for real numbers;
1514: both require a decimal point and contain
1515: an optional exponent field,
1516: but the first requires at least one digit before the
1517: decimal point and the second requires at least one
1518: digit after the decimal point.
1519: To correctly handle the problem
1520: posed by a Fortran expression such as
1521: .I 35.EQ.I ,
1522: which does not contain a real number, a context-sensitive
1523: rule such as
1524: .TS
1525: center;
1526: l l.
1527: [0\-9]+/"."EQ printf("integer");
1528: .TE
1529: could be used in addition to the normal rule for integers.
1530: .PP
1531: The definitions
1532: section may also contain other commands, including the
1533: selection of a host language, a character set table,
1534: a list of start conditions, or adjustments to the default
1535: size of arrays within Lex itself for larger source programs.
1536: These possibilities
1537: are discussed below under ``Summary of Source Format,''
1538: section 12.
1539: .2C
1540: .NH
1541: Usage.
1542: .PP
1543: There are two steps in
1544: compiling a Lex source program.
1545: First, the Lex source must be turned into a generated program
1546: in the host general purpose language.
1547: Then this program must be compiled and loaded, usually with
1548: a library of Lex subroutines.
1549: The generated program
1550: is on a file named
1551: .I lex.yy.c .
1552: The I/O library is defined in terms of the C standard
1553: library [6].
1554: .PP
1555: The C programs generated by Lex are slightly different
1556: on OS/370, because the
1557: OS compiler is less powerful than the UNIX or GCOS compilers,
1558: and does less at compile time.
1559: C programs generated on GCOS and UNIX are the same.
1560: .PP
1561: .I
1562: UNIX.
1563: .R
1564: The library is accessed by the loader flag
1565: .I \-ll .
1566: So an appropriate
1567: set of commands is
1568: .KS
1569: .in 5
1570: lex source
1571: cc lex.yy.c \-ll
1572: .in 0
1573: .KE
1574: The resulting program is placed on the usual file
1575: .I
1576: a.out
1577: .R
1578: for later execution.
1579: To use Lex with Yacc see below.
1580: Although the default Lex I/O routines use the C standard library,
1581: the Lex automata themselves do not do so;
1582: if private versions of
1583: .I
1584: input,
1585: output
1586: .R
1587: and
1588: .I unput
1589: are given, the library can be avoided.
1590: .PP
1591: .2C
1592: .NH
1593: Lex and Yacc.
1594: .PP
1595: If you want to use Lex with Yacc, note that what Lex writes is a program
1596: named
1597: .I
1598: yylex(),
1599: .R
1600: the name required by Yacc for its analyzer.
1601: Normally, the default main program on the Lex library
1602: calls this routine, but if Yacc is loaded, and its main
1603: program is used, Yacc will call
1604: .I
1605: yylex().
1606: .R
1607: In this case each Lex rule should end with
1608: .TS
1609: center;
1610: l.
1611: return(token);
1612: .TE
1613: where the appropriate token value is returned.
1614: An easy way to get access
1615: to Yacc's names for tokens is to
1616: compile the Lex output file as part of
1617: the Yacc output file by placing the line
1618: .TS
1619: center;
1620: l.
1621: # include "lex.yy.c"
1622: .TE
1623: in the last section of Yacc input.
1624: Supposing the grammar to be
1625: named ``good'' and the lexical rules to be named ``better''
1626: the UNIX command sequence can just be:
1627: .TS
1628: center;
1629: l.
1630: yacc good
1631: lex better
1632: cc y.tab.c \-ly \-ll
1633: .TE
1634: The Yacc library (\-ly) should be loaded before the Lex library,
1635: to obtain a main program which invokes the Yacc parser.
1636: The generations of Lex and Yacc programs can be done in
1637: either order.
1638: .2C
1639: .NH
1640: Examples.
1641: .PP
1642: As a trivial problem, consider copying an input file while
1643: adding 3 to every positive number divisible by 7.
1644: Here is a suitable Lex source program
1645: .TS
1646: center;
1647: l l.
1648: %%
1649: int k;
1650: [0\-9]+ {
1651: k = atoi(yytext);
1652: if (k%7 == 0)
1653: printf("%d", k+3);
1654: else
1655: printf("%d",k);
1656: }
1657: .TE
1658: to do just that.
1659: The rule [0\-9]+ recognizes strings of digits;
1660: .I
1661: atoi
1662: .R
1663: converts the digits to binary
1664: and stores the result in
1665: .ul
1666: k.
1667: The operator % (remainder) is used to check whether
1668: .ul
1669: k
1670: is divisible by 7; if it is,
1671: it is incremented by 3 as it is written out.
1672: It may be objected that this program will alter such
1673: input items as
1674: .I 49.63
1675: or
1676: .I X7 .
1677: Furthermore, it increments the absolute value
1678: of all negative numbers divisible by 7.
1679: To avoid this, just add a few more rules after the active one,
1680: as here:
1681: .TS
1682: center;
1683: l l.
1684: %%
1685: int k;
1686: \-?[0\-9]+ {
1687: k = atoi(yytext);
1688: printf("%d",
1689: k%7 == 0 ? k+3 : k);
1690: }
1691: \-?[0\-9.]+ ECHO;
1692: [A-Za-z][A-Za-z0-9]+ ECHO;
1693: .TE
1694: Numerical strings containing
1695: a ``.'' or preceded by a letter will be picked up by
1696: one of the last two rules, and not changed.
1697: The
1698: .I if\-else
1699: has been replaced by
1700: a C conditional expression to save space;
1701: the form
1702: .ul
1703: a?b:c
1704: means ``if
1705: .I a
1706: then
1707: .I b
1708: else
1709: .I c ''.
1710: .PP
1711: For an example of statistics gathering, here
1712: is a program which histograms the lengths
1713: of words, where a word is defined as a string of letters.
1714: .TS
1715: center;
1716: l l.
1717: int lengs[100];
1718: %%
1719: [a\-z]+ lengs[yyleng]++;
1720: \&. |
1721: \en ;
1722: %%
1723: .T&
1724: l s.
1725: yywrap()
1726: {
1727: int i;
1728: printf("Length No. words\en");
1729: for(i=0; i<100; i++)
1730: if (lengs[i] > 0)
1731: printf("%5d%10d\en",i,lengs[i]);
1732: return(1);
1733: }
1734: .TE
1735: This program
1736: accumulates the histogram, while producing no output. At the end
1737: of the input it prints the table.
1738: The final statement
1739: .I
1740: return(1);
1741: .R
1742: indicates that Lex is to perform wrapup. If
1743: .I
1744: yywrap
1745: .R
1746: returns zero (false)
1747: it implies that further input is available
1748: and the program is
1749: to continue reading and processing.
1750: To provide a
1751: .I
1752: yywrap
1753: .R
1754: that never
1755: returns true causes an infinite loop.
1756: .PP
1757: As a larger example,
1758: here are some parts of a program written by N. L. Schryer
1759: to convert double precision Fortran to single precision Fortran.
1760: Because Fortran does not distinguish upper and lower case letters,
1761: this routine begins by defining a set of classes including
1762: both cases of each letter:
1763: .TS
1764: center;
1765: l l.
1766: a [aA]
1767: b [bB]
1768: c [cC]
1769: \&...
1770: z [zZ]
1771: .TE
1772: An additional class recognizes white space:
1773: .TS
1774: center;
1775: l l.
1776: W [ \et]\(**
1777: .TE
1778: The first rule changes
1779: ``double precision'' to ``real'', or ``DOUBLE PRECISION'' to ``REAL''.
1780: .TS
1781: center;
1782: l.
1783: {d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n} {
1784: printf(yytext[0]==\(fmd\(fm? "real" : "REAL");
1785: }
1786: .TE
1787: Care is taken throughout this program to preserve the case
1788: (upper or lower)
1789: of the original program.
1790: The conditional operator is used to
1791: select the proper form of the keyword.
1792: The next rule copies continuation card indications to
1793: avoid confusing them with constants:
1794: .TS
1795: center;
1796: l l.
1797: ^" "[^ 0] ECHO;
1798: .TE
1799: In the regular expression, the quotes surround the
1800: blanks.
1801: It is interpreted as
1802: ``beginning of line, then five blanks, then
1803: anything but blank or zero.''
1804: Note the two different meanings of
1805: .I ^ .
1806: There follow some rules to change double precision
1807: constants to ordinary floating constants.
1808: .TS
1809: center;
1810: l.
1811: [0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+ |
1812: [0\-9]+{W}"."{W}{d}{W}[+\-]?{W}[0\-9]+ |
1813: "."{W}[0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+ {
1814: /\(** convert constants \(**/
1815: for(p=yytext; \(**p != 0; p++)
1816: {
1817: if (\(**p == \(fmd\(fm || \(**p == \(fmD\(fm)
1818: \(**p=+ \(fme\(fm\- \(fmd\(fm;
1819: ECHO;
1820: }
1821: .TE
1822: After the floating point constant is recognized, it is
1823: scanned by the
1824: .ul
1825: for
1826: loop
1827: to find the letter
1828: .I d
1829: or
1830: .I D .
1831: The program than adds
1832: .c
1833: .I \(fme\(fm\-\(fmd\(fm ,
1834: which converts
1835: it to the next letter of the alphabet.
1836: The modified constant, now single-precision,
1837: is written out again.
1838: There follow a series of names which must be respelled to remove
1839: their initial \fId\fR.
1840: By using the
1841: array
1842: .I
1843: yytext
1844: .R
1845: the same action suffices for all the names (only a sample of
1846: a rather long list is given here).
1847: .TS
1848: center;
1849: l l.
1850: {d}{s}{i}{n} |
1851: {d}{c}{o}{s} |
1852: {d}{s}{q}{r}{t} |
1853: {d}{a}{t}{a}{n} |
1854: \&...
1855: {d}{f}{l}{o}{a}{t} printf("%s",yytext+1);
1856: .TE
1857: Another list of names must have initial \fId\fR changed to initial \fIa\fR:
1858: .TS
1859: center;
1860: l l.
1861: {d}{l}{o}{g} |
1862: {d}{l}{o}{g}10 |
1863: {d}{m}{i}{n}1 |
1864: {d}{m}{a}{x}1 {
1865: yytext[0] =+ \(fma\(fm \- \(fmd\(fm;
1866: ECHO;
1867: }
1868: .TE
1869: And one routine
1870: must have initial \fId\fR changed to initial \fIr\fR:
1871: .TS
1872: center;
1873: l l.
1874: {d}1{m}{a}{c}{h} {yytext[0] =+ \(fmr\(fm \- \(fmd\(fm;
1875: ECHO;
1876: }
1877: .TE
1878: To avoid such names as \fIdsinx\fR being detected as instances
1879: of \fIdsin\fR, some final rules pick up longer words as identifiers
1880: and copy some surviving characters:
1881: .TS
1882: center;
1883: l l.
1884: [A\-Za\-z][A\-Za\-z0\-9]\(** |
1885: [0\-9]+ |
1886: \en |
1887: \&. ECHO;
1888: .TE
1889: Note that this program is not complete; it
1890: does not deal with the spacing problems in Fortran or
1891: with the use of keywords as identifiers.
1892: .br
1893: .2C
1894: .NH
1895: Left Context Sensitivity.
1896: .PP
1897: Sometimes
1898: it is desirable to have several sets of lexical rules
1899: to be applied at different times in the input.
1900: For example, a compiler preprocessor might distinguish
1901: preprocessor statements and analyze them differently
1902: from ordinary statements.
1903: This requires
1904: sensitivity
1905: to prior context, and there are several ways of handling
1906: such problems.
1907: The \fI^\fR operator, for example, is a prior context operator,
1908: recognizing immediately preceding left context just as \fI$\fR recognizes
1909: immediately following right context.
1910: Adjacent left context could be extended, to produce a facility similar to
1911: that for adjacent right context, but it is unlikely
1912: to be as useful, since often the relevant left context
1913: appeared some time earlier, such as at the beginning of a line.
1914: .PP
1915: This section describes three means of dealing
1916: with different environments: a simple use of flags,
1917: when only a few rules change from one environment to another,
1918: the use of
1919: .I
1920: start conditions
1921: .R
1922: on rules,
1923: and the possibility of making multiple lexical analyzers all run
1924: together.
1925: In each case, there are rules which recognize the need to change the
1926: environment in which the
1927: following input text is analyzed, and set some parameter
1928: to reflect the change. This may be a flag explicitly tested by
1929: the user's action code; such a flag is the simplest way of dealing
1930: with the problem, since Lex is not involved at all.
1931: It may be more convenient,
1932: however,
1933: to have Lex remember the flags as initial conditions on the rules.
1934: Any rule may be associated with a start condition. It will only
1935: be recognized when Lex is in
1936: that start condition.
1937: The current start condition may be changed at any time.
1938: Finally, if the sets of rules for the different environments
1939: are very dissimilar,
1940: clarity may be best achieved by writing several distinct lexical
1941: analyzers, and switching from one to another as desired.
1942: .PP
1943: Consider the following problem: copy the input to the output,
1944: changing the word \fImagic\fR to \fIfirst\fR on every line which began
1945: with the letter \fIa\fR, changing \fImagic\fR to \fIsecond\fR on every line
1946: which began with the letter \fIb\fR, and changing
1947: \fImagic\fR to \fIthird\fR on every line which began
1948: with the letter \fIc\fR. All other words and all other lines
1949: are left unchanged.
1950: .PP
1951: These rules are so simple that the easiest way
1952: to do this job is with a flag:
1953: .TS
1954: center;
1955: l l.
1956: int flag;
1957: %%
1958: ^a {flag = \(fma\(fm; ECHO;}
1959: ^b {flag = \(fmb\(fm; ECHO;}
1960: ^c {flag = \(fmc\(fm; ECHO;}
1961: \en {flag = 0 ; ECHO;}
1962: magic {
1963: switch (flag)
1964: {
1965: case \(fma\(fm: printf("first"); break;
1966: case \(fmb\(fm: printf("second"); break;
1967: case \(fmc\(fm: printf("third"); break;
1968: default: ECHO; break;
1969: }
1970: }
1971: .TE
1972: should be adequate.
1973: .PP
1974: To handle the same problem with start conditions, each
1975: start condition must be introduced to Lex in the definitions section
1976: with a line reading
1977: .TS
1978: center;
1979: l l.
1980: %Start name1 name2 ...
1981: .TE
1982: where the conditions may be named in any order.
1983: The word \fIStart\fR may be abbreviated to \fIs\fR or \fIS\fR.
1984: The conditions may be referenced at the
1985: head of a rule with the <> brackets:
1986: .TS
1987: center;
1988: l.
1989: <name1>expression
1990: .TE
1991: is a rule which is only recognized when Lex is in the
1992: start condition \fIname1\fR.
1993: To enter a start condition,
1994: execute the action statement
1995: .TS
1996: center;
1997: l.
1998: BEGIN name1;
1999: .TE
2000: which changes the start condition to \fIname1\fR.
2001: To resume the normal state,
2002: .TS
2003: center;
2004: l.
2005: BEGIN 0;
2006: .TE
2007: resets the initial condition
2008: of the Lex automaton interpreter.
2009: A rule may be active in several
2010: start conditions:
2011: .TS
2012: center;
2013: l.
2014: <name1,name2,name3>
2015: .TE
2016: is a legal prefix. Any rule not beginning with the
2017: <> prefix operator is always active.
2018: .PP
2019: The same example as before can be written:
2020: .TS
2021: center;
2022: l l.
2023: %START AA BB CC
2024: %%
2025: ^a {ECHO; BEGIN AA;}
2026: ^b {ECHO; BEGIN BB;}
2027: ^c {ECHO; BEGIN CC;}
2028: \en {ECHO; BEGIN 0;}
2029: <AA>magic printf("first");
2030: <BB>magic printf("second");
2031: <CC>magic printf("third");
2032: .TE
2033: where the logic is exactly the same as in the previous
2034: method of handling the problem, but Lex does the work
2035: rather than the user's code.
2036: .2C
2037: .NH
2038: Character Set.
2039: .PP
2040: The programs generated by Lex handle
2041: character I/O only through the routines
2042: .I
2043: input,
2044: output,
2045: .R
2046: and
2047: .I
2048: unput.
2049: .R
2050: Thus the character representation
2051: provided in these routines
2052: is accepted by Lex and employed to return
2053: values in
2054: .I
2055: yytext.
2056: .R
2057: For internal use
2058: a character is represented as a small integer
2059: which, if the standard library is used,
2060: has a value equal to the integer value of the bit
2061: pattern representing the character on the host computer.
2062: Normally, the letter
2063: .I a
2064: is represented as the same form as the character constant
2065: .I \(fma\(fm .
2066: If this interpretation is changed, by providing I/O
2067: routines which translate the characters,
2068: Lex must be told about
2069: it, by giving a translation table.
2070: This table must be in the definitions section,
2071: and must be bracketed by lines containing only
2072: ``%T''.
2073: The table contains lines of the form
2074: .TS
2075: center;
2076: l.
2077: {integer} {character string}
2078: .TE
2079: which indicate the value associated with each character.
2080: Thus the next example
2081: .GS 2
2082: .TS
2083: center;
2084: l l.
2085: %T
2086: 1 Aa
2087: 2 Bb
2088: \&...
2089: 26 Zz
2090: 27 \en
2091: 28 +
2092: 29 \-
2093: 30 0
2094: 31 1
2095: \&...
2096: 39 9
2097: %T
2098: .TE
2099: .sp
2100: .ce 1
2101: Sample character table.
2102: .GE
2103: maps the lower and upper case letters together into the integers 1 through 26,
2104: newline into 27, + and \- into 28 and 29, and the
2105: digits into 30 through 39.
2106: Note the escape for newline.
2107: If a table is supplied, every character that is to appear either
2108: in the rules or in any valid input must be included
2109: in the table.
2110: No character
2111: may be assigned the number 0, and no character may be
2112: assigned a bigger number than the size of the hardware character set.
2113: .2C
2114: .NH
2115: Summary of Source Format.
2116: .PP
2117: The general form of a Lex source file is:
2118: .TS
2119: center;
2120: l.
2121: {definitions}
2122: %%
2123: {rules}
2124: %%
2125: {user subroutines}
2126: .TE
2127: The definitions section contains
2128: a combination of
2129: .IP 1)
2130: Definitions, in the form ``name space translation''.
2131: .IP 2)
2132: Included code, in the form ``space code''.
2133: .IP 3)
2134: Included code, in the form
2135: .TS
2136: center;
2137: l.
2138: %{
2139: code
2140: %}
2141: .TE
2142: .ns
2143: .IP 4)
2144: Start conditions, given in the form
2145: .TS
2146: center;
2147: l.
2148: %S name1 name2 ...
2149: .TE
2150: .ns
2151: .IP 5)
2152: Character set tables, in the form
2153: .TS
2154: center;
2155: l.
2156: %T
2157: number space character-string
2158: \&...
2159: %T
2160: .TE
2161: .ns
2162: .IP 6)
2163: Changes to internal array sizes, in the form
2164: .TS
2165: center;
2166: l.
2167: %\fIx\fR\0\0\fInnn\fR
2168: .TE
2169: where \fInnn\fR is a decimal integer representing an array size
2170: and \fIx\fR selects the parameter as follows:
2171: .TS
2172: center;
2173: c c
2174: c l.
2175: Letter Parameter
2176: p positions
2177: n states
2178: e tree nodes
2179: a transitions
2180: k packed character classes
2181: o output array size
2182: .TE
2183: .LP
2184: Lines in the rules section have the form ``expression action''
2185: where the action may be continued on succeeding
2186: lines by using braces to delimit it.
2187: .PP
2188: Regular expressions in Lex use the following
2189: operators:
2190: .br
2191: .TS
2192: center;
2193: l l.
2194: x the character "x"
2195: "x" an "x", even if x is an operator.
2196: \ex an "x", even if x is an operator.
2197: [xy] the character x or y.
2198: [x\-z] the characters x, y or z.
2199: [^x] any character but x.
2200: \&. any character but newline.
2201: ^x an x at the beginning of a line.
2202: <y>x an x when Lex is in start condition y.
2203: x$ an x at the end of a line.
2204: x? an optional x.
2205: x\(** 0,1,2, ... instances of x.
2206: x+ 1,2,3, ... instances of x.
2207: x|y an x or a y.
2208: (x) an x.
2209: x/y an x but only if followed by y.
2210: {xx} the translation of xx from the
2211: definitions section.
2212: x{m,n} \fIm\fR through \fIn\fR occurrences of x
2213: .TE
2214: .NH
2215: Caveats and Bugs.
2216: .PP
2217: There are pathological expressions which
2218: produce exponential growth of the tables when
2219: converted to deterministic machines;
2220: fortunately, they are rare.
2221: .PP
2222: REJECT does not rescan the input; instead it remembers the results of the previous
2223: scan. This means that if a rule with trailing context is found, and
2224: REJECT executed, the user
2225: must not have used
2226: .ul
2227: unput
2228: to change the characters forthcoming
2229: from the input stream.
2230: This is the only restriction on the user's ability to manipulate
2231: the not-yet-processed input.
2232: .PP
2233: .2C
2234: .NH
2235: Acknowledgments.
2236: .PP
2237: As should
2238: be obvious from the above, the outside of Lex
2239: is patterned
2240: on Yacc and the inside on Aho's string matching routines.
2241: Therefore, both S. C. Johnson and A. V. Aho
2242: are really originators
2243: of much of Lex,
2244: as well as debuggers of it.
2245: Many thanks are due to both.
2246: .PP
2247: The code of the current version of Lex was designed, written,
2248: and debugged by Eric Schmidt.
2249: .SG MH-1274-MEL-unix
2250: .sp 1
2251: .2C
2252: .NH
2253: References.
2254: .SP 1v
2255: .IP 1.
2256: B. W. Kernighan and D. M. Ritchie,
2257: .I
2258: The C Programming Language,
2259: .R
2260: Prentice-Hall, N. J. (1978).
2261: .IP 2.
2262: B. W. Kernighan,
2263: .I
2264: Ratfor: A Preprocessor for a Rational Fortran,
2265: .R
2266: Software \- Practice and Experience,
2267: \fB5\fR, pp. 395-496 (1975).
2268: .IP 3.
2269: S. C. Johnson,
2270: .I
2271: Yacc: Yet Another Compiler Compiler,
2272: .R
2273: Computing Science Technical Report No. 32,
2274: 1975,
2275: .MH
2276: .if \n(tm (also TM 75-1273-6)
2277: .IP 4.
2278: A. V. Aho and M. J. Corasick,
2279: .I
2280: Efficient String Matching: An Aid to Bibliographic Search,
2281: .R
2282: Comm. ACM
2283: .B
2284: 18,
2285: .R
2286: 333-340 (1975).
2287: .IP 5.
2288: B. W. Kernighan, D. M. Ritchie and K. L. Thompson,
2289: .I
2290: QED Text Editor,
2291: .R
2292: Computing Science Technical Report No. 5,
2293: 1972,
2294: .MH
2295: .IP 6.
2296: D. M. Ritchie,
2297: private communication.
2298: See also
2299: M. E. Lesk,
2300: .I
2301: The Portable C Library,
2302: .R
2303: Computing Science Technical Report No. 31,
2304: .MH
2305: .if \n(tm (also TM 75-1274-11)
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.