|
|
1.1 root 1: A lexical analyzer generator for Standard ML.
2:
3:
4: Andrew W. Appel
5: James S. Mattson
6: David R. Tarditi
7:
8: Princeton University
9:
10: Version 1.2, October 1989
11:
12: Copyright (c) 1989 by Andrew W. Appel, James S. Mattson, David R. Tarditi
13:
14: This software comes with ABSOLUTELY NO WARRANTY.
15: This software is subject only to the PRINCETON STANDARD ML SOFTWARE LIBRARY
16: COPYRIGHT NOTICE, LICENSE AND DISCLAIMER, (in the file "COPYRIGHT",
17: distributed with this software). You may copy and distribute this software;
18: see the COPYRIGHT NOTICE for details and restrictions.
19:
20: I. General Description
21:
22: Computer programs often need to divide their input into words and
23: distinguish between different kinds of words. Compilers, for
24: example, need to distinguish between integers, reserved words, and
25: identifiers. Applications programs often need to be able to
26: recognize components of typed commands from users.
27:
28: The problem of segmenting input into words and recognizing classes of
29: words is known as lexical analysis. Small cases of this problem,
30: such as reading text strings separated by spaces, can be solved by
31: using hand-written programs. Larger cases of this problem, such as
32: tokenizing an input stream for a compiler, can also be solved using
33: hand-written programs.
34:
35: A hand-written program for a large lexical analysis problem, however,
36: suffers from two major problems. First, the program requires a fair
37: amount of programmer time to create. Second, the description of
38: classes of words is not explicit in the program. It must be inferred
39: from the program code. This makes it difficult to verify if the
40: program recognizes the correct words for each class. It also makes
41: future maintenance of the program difficult.
42:
43: Lex, a programming tool for the Unix system, is a successful solution
44: to the general problem of lexical analysis. It uses regular
45: expressions to describe classes of words. A program fragment is
46: associated with each class of words. This information is given to
47: Lex as a specification (a Lex program). Lex produces a program for a
48: function that can be used to perform lexical analysis.
49:
50: The function operates as follows. It finds the longest word starting
51: from the current position in the input stream that is in one of the
52: word classes. It executes the program fragment associated with the
53: class, and sets the current position in the input stream to be the
54: character after the word. The program fragment has the actual text
55: of the word available to it, and may be any piece of code. For many
56: applications it returns some kind of value.
57:
58: Lex allows the programmer to make the language description explicit,
59: and to concentrate on what to do with the recognized words, not how
60: to recognize the words. It saves programmer time and increases
61: program maintainability.
62:
63: Unfortunately, Lex is targeted only C. It also places artificial
64: limits on the size of strings that can be recognized.
65:
66: ML-Lex is a variant of Lex for the ML programming language. ML-Lex
67: has a syntax similar to Lex, and produces an ML program instead of a
68: C program. ML-Lex produces a program that runs very efficiently.
69: Typically the program will be as fast or even faster than a
70: hand-coded lexer implemented in Standard ML.
71:
72: The program typically uses only a small amount of space.
73: ML-Lex thus allows ML programmers the same benefits that Lex allows C
74: programmers. It also does not place artificial limits on the size of
75: recognized strings.
76:
77: II. ML-Lex specifications
78:
79: An ML-Lex specification has the general format:
80:
81: {user declarations}
82: %%
83: {ML-Lex definitions}
84: %%
85: {rules}
86:
87: Each section is separated from the others by a '%%' delimiter.
88:
89: The rules are used to define the lexical analysis function. Each
90: rule has two parts - a regular expression and an action. The regular
91: expression defines the word class that a rule matches. The action is
92: a program fragment to be executed when a rule matches the input. The
93: actions are used to compute values, and must all return values of the
94: same type.
95:
96: The user can define values available to all rule actions in the user
97: declarations section. The user must define two values in this
98: section - a type lexresult and a function eof. Lexresult defines the
99: type of values returned by the rule actions. The function "eof" is
100: called by the lexer when the end of the input stream is reached. It
101: will typically return a value signalling eof or raise an exception.
102: It is called with the same argument as lex (see %arg, below),
103: and must return a value of type lexresult.
104:
105: In the definitions section, the user can define named regular
106: expressions, a set of start states, and specify which of the various
107: bells and whistles of ML-Lex are desired.
108:
109: The start states allow the user to control when certain rules are
110: matched. Rules may be defined to match only when the lexer is in
111: specific start states. The user may change the lexer's start state
112: in a rule action. This allows the user to specify special handling
113: of lexical objects.
114:
115: This feature is typically used to handle quoted strings with escapes
116: to denote special characters. The rules to recognize the inside
117: contents of a string are defined for only one start state. This
118: start state is entered when the beginning of a string is recognized,
119: and exited when the end of the string is recognized.
120:
121: III. Regular expressions.
122:
123: Regular expressions are a simple language for denoting classes of
124: strings. A regular expression is defined inductively over an
125: alphabet with a set of basic operations. The alphabet for ML-Lex is
126: the Ascii character set (character codes 0-127; or if %full is used, 0-255).
127:
128: The syntax and semantics of regular expressions will be described in
129: order of decreasing precedence (from the most tightly-binding operators
130: to the most weakly-binding):
131:
132: An individual character stands for itself, except for the
133: reserved characters ? * + | ( ) ^ $ / ; . = < > [ { " \
134:
135: A backslash followed by one of the reserved characters stands
136: for that character.
137:
138: A set of characters enclosed in square brackets [ ] stands
139: for any one of those characters. Inside the brackets, only
140: the symbols \ - ^ are reserved. An initial up-arrow ^ stands
141: for the complement of the characters listed, e.g. [^abc]
142: stands any character except a, b, or c. The hyphen - denotes
143: a range of characters, e.g. [a-z] stands for any lower-case
144: alphabetic character, and [0-9a-fA-F] stands for any hexadecimal
145: digit. To include ^ literally in a bracketed set, put it anywhere
146: but first; to include - literally in a set, put it first or last.
147:
148: The dot . character stands for any character except newline,
149: i.e. the same as [^\n]
150:
151: The following special escape sequences are available, inside
152: or outside of square-brackets:
153: \b - backspace
154: \n - newline
155: \t - tab
156: \ddd - where ddd is a 3 digit decimal escape.
157:
158: A sequence of characters will stand for itself (reserved
159: characters will be taken literally) if it is enclosed in
160: double quotes " ".
161:
162: A named regular expression (defined in the "definitions"
163: section) may be referred to by enclosing its name in
164: braces { }.
165:
166: Any regular expression may be enclosed in parentheses ( )
167: for syntactic (but, as usual, not semantic) effect.
168:
169: The postfix operator * stands for Kleene closure:
170: zero or more repetitions of the preceding expression.
171:
172: The postfix operator + stands for one or more repetitions
173: of the preceding expression.
174:
175: The postfix operator ? stands for zero or one occurrence of
176: the preceding expression.
177:
178: A postfix repetition range {n1,n2} where n1 and n2 are small
179: integers stands for any number of repetitions between n1 and n2
180: of the preceding expression. The notation {n1} stands for
181: exactly n1 repetitions.
182:
183: Concatenation of expressions denotes concatenation of strings.
184: The expression e1 e2 stands for any string that results from
185: the concatenation of one string that matches e1 with another
186: string that matches e2.
187:
188: The infix operator | stands for alternation. The expression
189: e1 | e2 stands for anything that either e1 or e2 stands for.
190:
191: The infix operator / denotes lookahead. The expression e1 / e2
192: matches any string that e1 stands for, but only when that
193: string is followed by a string that matches e2.
194:
195: When the up-arrow ^ occurs at the beginning of an expression,
196: that expression will only match strings that occur at the
197: beginning of a line (right after a newline character).
198: When the dollar sign $ occurs at the end of an expression,
199: that expression will only match strings that occur at the
200: end of a line (right before a newline character).
201:
202: Here are some examples of regular expressions, and descriptions of the
203: set of strings they denote:
204:
205: 0 | 1 | 2 | 3 A single digit between 0 and 3
206: [0123] A single digit between 0 and 3
207: 0123 The string "0123"
208: 0* All strings of 0 or more 0's
209: 00* All strings of 1 or more 0's
210: 0+ All strings of 1 or more 0's
211: [0-9]{3} Any three-digit decimal number.
212: \\[ntb] The strings "\n" "\t" "\b"
213: (00)* Any string with an even number of 0's.
214:
215: IV. ML-Lex syntax summary
216:
217: A. User declarations
218:
219: Anything up to the first %% is in the user declarations section. The
220: user should note that no symbolic identifier containing '%%' can be
221: used in this section.
222:
223: B. ML-Lex definitions
224:
225: Start states can be defined with
226:
227: %s {identifier list} ;
228:
229: or %S {identifier list} ;
230:
231: An identifier list consists of one or more identifiers.
232:
233: An identifier consists of one or more letters, digits, underscores,
234: or primes. It must begin with a letter.
235:
236: Named expressions can be defined with
237:
238: {identifier} = {regular expression} ;
239:
240: Regular expressions are defined below.
241:
242: The following % commands are also available:
243:
244: %reject - create REJECT() function
245: %count - count newlines using yylineno
246: %full - create lexer for the full 8-bit character set,
247: with characters in the range 0-255 permitted
248: as input.
249: %structure {identifier} - name the structure in the output program
250: {identifier} instead of Mlex
251: %header - use code following it to create header for lexer
252: structure
253: %arg - extra (curried) formal parameter argument to be
254: passed to the lex functions, and to be passed
255: to the eof function in place of ()
256: These functions are discussed below, under values available to
257: actions.
258:
259: C. Rules
260:
261: Each rule has the format:
262:
263: <start state list> {regular expression} => ( ... code ... );
264:
265: All parentheses in ... code ... must be balanced, including those
266: used in strings and comments.
267:
268: The start state list is optional. It consists of a list of
269: identifiers separated by commas, and is delimited by triangle
270: brackets < >. Each identifier must be a start state defined in the
271: %s section above.
272:
273: The regular expression is only recognized when the lexer is in one of
274: the start states in the start state list. If no start state list is
275: given, the expression is recognized in all start states.
276:
277: The lexer begins in a pre-defined start state called INITIAL.
278:
279: The lexer resolves conflicts among rules by choosing the rule with
280: the longest match, and in the case two rules match the same string,
281: choosing the rule listed first in the specification.
282:
283: The rules should match all possible input. If some input occurs that
284: does not match any rule, the lexer created by ML-Lex will raise an
285: exception LexError. Note that this differs from C Lex, which prints
286: any unmatched input on the standard output.
287:
288: V. Values available inside the code associated with a rule.
289:
290: Mlex places the value of the string matched by a regular expression
291: in yytext. Yytext has type string. The user may also recursively
292: call the lexing function with lex().
293:
294: This is convenient for ignoring white space or comments silently:
295:
296: [\ \t\n]+ => ( lex());
297:
298: To switch start states, the user may call YYBEGIN with the name of a
299: start state.
300:
301: The following values will be available only if the corresponding %
302: command is in the ML-Lex definitions sections:
303:
304: value %command description
305: ----- -------- -----------
306: REJECT %reject REJECT() causes the current
307: rule to be "rejected."
308: The lexer behaves as if the
309: current rule had not matched;
310: another rule that matches this
311: string, or that matches the longest
312: possible prefix of this string,
313: is used instead.
314:
315: yylineno %count Current line number
316:
317:
318: These values should be used only if necessary. Adding REJECT to a
319: lexer will slow it down by 20%; adding yylineno will slow it down by
320: another 20%, or more. (It is much more efficient to recognize \n and
321: have an action that increments the line-number variable.) The use of
322: the lookahead operator / will also slow down the entire lexer.
323:
324: VI. Running ML-Lex
325:
326: Use "lexgen.sml"; this will create a structure LexGen. The function
327: LexGen.lexGen creates a program for a lexer from an input
328: specification. It takes a string argument -- the name of the file
329: containing the input specification. The output file name is
330: determined by appending ".sml" to the input file name.
331:
332: VII. Using the program produced by ML-Lex.
333:
334: When the output file is loaded, it will create a structure Mlex that
335: contains the function makeLexer. makeLexer takes a function from int
336: -> string and returns a lexing function.
337:
338: For example,
339:
340: val lexer = Mlex.makeLexer (input (open_in "f"))
341:
342: creates a lexer that operates on the file whose name is f.
343:
344: The function from int -> string should read a string of characters
345: from the input stream. It should return a null string to indicate
346: that the end of the stream has been reached. The integer is the
347: number of characters that the lexer wishes to read; the function may
348: return any non-zero number of characters. For example,
349: val lexer = Mlex.makeLexer (fn n => input_line std_in)
350: is appropriate for interactive streams where prompting, etc. occurs;
351: the lexer won't care that input_line might return a string of more
352: than or less than n characters.
353:
354: The lexer tries to read a large number of characters from the input
355: function at once, and it is desirable that the input function return
356: as many as possible. Reading many characters at once makes the lexer
357: more efficient. Fewer input calls and buffering operations are
358: needed, and input is more efficient in large block reads. For
359: interactive streams this is less of a concern, as the limiting factor
360: is the speed at which the user can type.
361:
362: To obtain a value, invoke the lexer by passing it a unit:
363:
364: val nextToken = lexer()
365:
366: If one wanted to restart the lexer, one would just discard "lexer"
367: and create a new lexer on the same stream with another call to
368: makeLexer. This is the best way to discard any characters buffered
369: internally by the lexer.
370:
371: All code in the user declarations section is placed inside a
372: structure UserDeclarations. To access this structure, use the path name
373: Mlex.UserDeclarations.
374:
375: If any input cannot be matched, the program will raise the exception
376: Mlex.LexError. An internal error (i.e. bug) will cause the
377: exception Internal.LexerError to be raised.
378:
379: If %structure is used, remember that the structure name will no
380: longer be Mlex, but the one specified in the command.
381:
382: VIII. Sample
383:
384: Here is a sample lexer for a calculator program:
385:
386: datatype lexresult= DIV | EOF | EOS | ID of string | LPAREN |
387: NUM of int | PLUS | PRINT | RPAREN | SUB | TIMES
388:
389: val linenum = ref 1
390: val error = fn x => output std_out (x ^ "\n")
391: val eof = fn () => EOF
392: %%
393: %structure CalcLex
394: alpha=[A-Za-z];
395: digit=[0-9];
396: ws = [\ \t];
397: %%
398: \n => (inc linenum; lex());
399: {ws}+ => (lex());
400: "/" => (DIV);
401: ";" => (EOS);
402: "(" => (LPAREN);
403: {digit}+ => (NUM (revfold (fn(a,r)=>ord(a)-ord("0")+10*r) (explode yytext) 0));
404: ")" => (RPAREN);
405: "+" => (PLUS);
406: {alpha}+ => (if yytext="print" then PRINT else ID yytext);
407: "-" => (SUB);
408: "*" => (TIMES);
409: . => (error ("calc: ignoring bad character "^yytext); lex());
410:
411:
412: Here is the parser for the calculator:
413:
414: (* Sample interactive calculator to demonstrate use of lexer produced by ML-Lex
415:
416: The original grammar was
417:
418: stmt_list -> stmt_list stmt
419: stmt -> print exp ; | exp ;
420: exp -> exp + t | exp - t | t
421: t -> t * f | t/f | f
422: f -> (exp) | id | num
423:
424: The function parse takes a stream and parses it for the calculator
425: program.
426:
427: If a syntax error occurs, parse prints an error message and calls itself
428: on the stream. On this system that has the effect of ignoring all input
429: to the end of a line.
430: *)
431:
432: structure Calc =
433: struct
434: open CalcLex
435: open UserDeclarations
436: exception Error
437: fun parse strm =
438: let
439: val say = output std_out
440: val lexer = makeLexer (fn n => input strm (min(1,max(can_input strm,n))))
441: val nexttok = ref (lexer())
442: val advance = fn () => (nexttok := lexer(); !nexttok)
443: val error = fn () => (say ("calc: syntax error on line" ^
444: (makestring(!linenum)) ^ "\n"); raise Error)
445: val lookup = fn i =>
446: if i = "ONE" then 1
447: else if i = "TWO" then 2
448: else (say ("calc: unknown identifier '" ^ i ^ "'\n"); raise Error)
449: fun STMT_LIST () =
450: case !nexttok of
451: EOF => ()
452: | _ => (STMT(); STMT_LIST())
453:
454: and STMT() =
455: (case !nexttok
456: of EOS => ()
457: | PRINT => (advance(); say ((makestring (E():int)) ^ "\n"); ())
458: | _ => (E(); ());
459: case !nexttok
460: of EOS => (advance())
461: | _ => error())
462: and E () = E' (T())
463: and E' (i : int ) =
464: case !nexttok of
465: PLUS => (advance (); E'(i+T()))
466: | SUB => (advance (); E'(i-T()))
467: | RPAREN => i
468: | EOF => i
469: | EOS => i
470: | _ => error()
471: and T () = T'(F())
472: and T' i =
473: case !nexttok of
474: PLUS => i
475: | SUB => i
476: | TIMES => (advance(); T'(i*F()))
477: | DIV => (advance (); T'(i div F()))
478: | EOF => i
479: | EOS => i
480: | RPAREN => i
481: | _ => error()
482: and F () =
483: case !nexttok of
484: ID i => (advance(); lookup i)
485: | LPAREN =>
486: let val v = (advance(); E())
487: in if !nexttok = RPAREN then (advance (); v) else error()
488: end
489: | NUM i => (advance(); i)
490: | _ => error()
491: in STMT_LIST () handle Error => parse strm
492: end
493: end
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.