|
|
1.1 ! root 1: A lexical analyzer generator for Standard ML. ! 2: ! 3: ! 4: Andrew W. Appel ! 5: James S. Mattson ! 6: David R. Tarditi ! 7: ! 8: Princeton University ! 9: ! 10: Version 1.2, October 1989 ! 11: ! 12: Copyright (c) 1989 by Andrew W. Appel, James S. Mattson, David R. Tarditi ! 13: ! 14: This software comes with ABSOLUTELY NO WARRANTY. ! 15: This software is subject only to the PRINCETON STANDARD ML SOFTWARE LIBRARY ! 16: COPYRIGHT NOTICE, LICENSE AND DISCLAIMER, (in the file "COPYRIGHT", ! 17: distributed with this software). You may copy and distribute this software; ! 18: see the COPYRIGHT NOTICE for details and restrictions. ! 19: ! 20: I. General Description ! 21: ! 22: Computer programs often need to divide their input into words and ! 23: distinguish between different kinds of words. Compilers, for ! 24: example, need to distinguish between integers, reserved words, and ! 25: identifiers. Applications programs often need to be able to ! 26: recognize components of typed commands from users. ! 27: ! 28: The problem of segmenting input into words and recognizing classes of ! 29: words is known as lexical analysis. Small cases of this problem, ! 30: such as reading text strings separated by spaces, can be solved by ! 31: using hand-written programs. Larger cases of this problem, such as ! 32: tokenizing an input stream for a compiler, can also be solved using ! 33: hand-written programs. ! 34: ! 35: A hand-written program for a large lexical analysis problem, however, ! 36: suffers from two major problems. First, the program requires a fair ! 37: amount of programmer time to create. Second, the description of ! 38: classes of words is not explicit in the program. It must be inferred ! 39: from the program code. This makes it difficult to verify if the ! 40: program recognizes the correct words for each class. It also makes ! 41: future maintenance of the program difficult. ! 42: ! 43: Lex, a programming tool for the Unix system, is a successful solution ! 44: to the general problem of lexical analysis. It uses regular ! 45: expressions to describe classes of words. A program fragment is ! 46: associated with each class of words. This information is given to ! 47: Lex as a specification (a Lex program). Lex produces a program for a ! 48: function that can be used to perform lexical analysis. ! 49: ! 50: The function operates as follows. It finds the longest word starting ! 51: from the current position in the input stream that is in one of the ! 52: word classes. It executes the program fragment associated with the ! 53: class, and sets the current position in the input stream to be the ! 54: character after the word. The program fragment has the actual text ! 55: of the word available to it, and may be any piece of code. For many ! 56: applications it returns some kind of value. ! 57: ! 58: Lex allows the programmer to make the language description explicit, ! 59: and to concentrate on what to do with the recognized words, not how ! 60: to recognize the words. It saves programmer time and increases ! 61: program maintainability. ! 62: ! 63: Unfortunately, Lex is targeted only C. It also places artificial ! 64: limits on the size of strings that can be recognized. ! 65: ! 66: ML-Lex is a variant of Lex for the ML programming language. ML-Lex ! 67: has a syntax similar to Lex, and produces an ML program instead of a ! 68: C program. ML-Lex produces a program that runs very efficiently. ! 69: Typically the program will be as fast or even faster than a ! 70: hand-coded lexer implemented in Standard ML. ! 71: ! 72: The program typically uses only a small amount of space. ! 73: ML-Lex thus allows ML programmers the same benefits that Lex allows C ! 74: programmers. It also does not place artificial limits on the size of ! 75: recognized strings. ! 76: ! 77: II. ML-Lex specifications ! 78: ! 79: An ML-Lex specification has the general format: ! 80: ! 81: {user declarations} ! 82: %% ! 83: {ML-Lex definitions} ! 84: %% ! 85: {rules} ! 86: ! 87: Each section is separated from the others by a '%%' delimiter. ! 88: ! 89: The rules are used to define the lexical analysis function. Each ! 90: rule has two parts - a regular expression and an action. The regular ! 91: expression defines the word class that a rule matches. The action is ! 92: a program fragment to be executed when a rule matches the input. The ! 93: actions are used to compute values, and must all return values of the ! 94: same type. ! 95: ! 96: The user can define values available to all rule actions in the user ! 97: declarations section. The user must define two values in this ! 98: section - a type lexresult and a function eof. Lexresult defines the ! 99: type of values returned by the rule actions. The function "eof" is ! 100: called by the lexer when the end of the input stream is reached. It ! 101: will typically return a value signalling eof or raise an exception. ! 102: It is called with the same argument as lex (see %arg, below), ! 103: and must return a value of type lexresult. ! 104: ! 105: In the definitions section, the user can define named regular ! 106: expressions, a set of start states, and specify which of the various ! 107: bells and whistles of ML-Lex are desired. ! 108: ! 109: The start states allow the user to control when certain rules are ! 110: matched. Rules may be defined to match only when the lexer is in ! 111: specific start states. The user may change the lexer's start state ! 112: in a rule action. This allows the user to specify special handling ! 113: of lexical objects. ! 114: ! 115: This feature is typically used to handle quoted strings with escapes ! 116: to denote special characters. The rules to recognize the inside ! 117: contents of a string are defined for only one start state. This ! 118: start state is entered when the beginning of a string is recognized, ! 119: and exited when the end of the string is recognized. ! 120: ! 121: III. Regular expressions. ! 122: ! 123: Regular expressions are a simple language for denoting classes of ! 124: strings. A regular expression is defined inductively over an ! 125: alphabet with a set of basic operations. The alphabet for ML-Lex is ! 126: the Ascii character set (character codes 0-127; or if %full is used, 0-255). ! 127: ! 128: The syntax and semantics of regular expressions will be described in ! 129: order of decreasing precedence (from the most tightly-binding operators ! 130: to the most weakly-binding): ! 131: ! 132: An individual character stands for itself, except for the ! 133: reserved characters ? * + | ( ) ^ $ / ; . = < > [ { " \ ! 134: ! 135: A backslash followed by one of the reserved characters stands ! 136: for that character. ! 137: ! 138: A set of characters enclosed in square brackets [ ] stands ! 139: for any one of those characters. Inside the brackets, only ! 140: the symbols \ - ^ are reserved. An initial up-arrow ^ stands ! 141: for the complement of the characters listed, e.g. [^abc] ! 142: stands any character except a, b, or c. The hyphen - denotes ! 143: a range of characters, e.g. [a-z] stands for any lower-case ! 144: alphabetic character, and [0-9a-fA-F] stands for any hexadecimal ! 145: digit. To include ^ literally in a bracketed set, put it anywhere ! 146: but first; to include - literally in a set, put it first or last. ! 147: ! 148: The dot . character stands for any character except newline, ! 149: i.e. the same as [^\n] ! 150: ! 151: The following special escape sequences are available, inside ! 152: or outside of square-brackets: ! 153: \b - backspace ! 154: \n - newline ! 155: \t - tab ! 156: \ddd - where ddd is a 3 digit decimal escape. ! 157: ! 158: A sequence of characters will stand for itself (reserved ! 159: characters will be taken literally) if it is enclosed in ! 160: double quotes " ". ! 161: ! 162: A named regular expression (defined in the "definitions" ! 163: section) may be referred to by enclosing its name in ! 164: braces { }. ! 165: ! 166: Any regular expression may be enclosed in parentheses ( ) ! 167: for syntactic (but, as usual, not semantic) effect. ! 168: ! 169: The postfix operator * stands for Kleene closure: ! 170: zero or more repetitions of the preceding expression. ! 171: ! 172: The postfix operator + stands for one or more repetitions ! 173: of the preceding expression. ! 174: ! 175: The postfix operator ? stands for zero or one occurrence of ! 176: the preceding expression. ! 177: ! 178: A postfix repetition range {n1,n2} where n1 and n2 are small ! 179: integers stands for any number of repetitions between n1 and n2 ! 180: of the preceding expression. The notation {n1} stands for ! 181: exactly n1 repetitions. ! 182: ! 183: Concatenation of expressions denotes concatenation of strings. ! 184: The expression e1 e2 stands for any string that results from ! 185: the concatenation of one string that matches e1 with another ! 186: string that matches e2. ! 187: ! 188: The infix operator | stands for alternation. The expression ! 189: e1 | e2 stands for anything that either e1 or e2 stands for. ! 190: ! 191: The infix operator / denotes lookahead. The expression e1 / e2 ! 192: matches any string that e1 stands for, but only when that ! 193: string is followed by a string that matches e2. ! 194: ! 195: When the up-arrow ^ occurs at the beginning of an expression, ! 196: that expression will only match strings that occur at the ! 197: beginning of a line (right after a newline character). ! 198: When the dollar sign $ occurs at the end of an expression, ! 199: that expression will only match strings that occur at the ! 200: end of a line (right before a newline character). ! 201: ! 202: Here are some examples of regular expressions, and descriptions of the ! 203: set of strings they denote: ! 204: ! 205: 0 | 1 | 2 | 3 A single digit between 0 and 3 ! 206: [0123] A single digit between 0 and 3 ! 207: 0123 The string "0123" ! 208: 0* All strings of 0 or more 0's ! 209: 00* All strings of 1 or more 0's ! 210: 0+ All strings of 1 or more 0's ! 211: [0-9]{3} Any three-digit decimal number. ! 212: \\[ntb] The strings "\n" "\t" "\b" ! 213: (00)* Any string with an even number of 0's. ! 214: ! 215: IV. ML-Lex syntax summary ! 216: ! 217: A. User declarations ! 218: ! 219: Anything up to the first %% is in the user declarations section. The ! 220: user should note that no symbolic identifier containing '%%' can be ! 221: used in this section. ! 222: ! 223: B. ML-Lex definitions ! 224: ! 225: Start states can be defined with ! 226: ! 227: %s {identifier list} ; ! 228: ! 229: or %S {identifier list} ; ! 230: ! 231: An identifier list consists of one or more identifiers. ! 232: ! 233: An identifier consists of one or more letters, digits, underscores, ! 234: or primes. It must begin with a letter. ! 235: ! 236: Named expressions can be defined with ! 237: ! 238: {identifier} = {regular expression} ; ! 239: ! 240: Regular expressions are defined below. ! 241: ! 242: The following % commands are also available: ! 243: ! 244: %reject - create REJECT() function ! 245: %count - count newlines using yylineno ! 246: %full - create lexer for the full 8-bit character set, ! 247: with characters in the range 0-255 permitted ! 248: as input. ! 249: %structure {identifier} - name the structure in the output program ! 250: {identifier} instead of Mlex ! 251: %header - use code following it to create header for lexer ! 252: structure ! 253: %arg - extra (curried) formal parameter argument to be ! 254: passed to the lex functions, and to be passed ! 255: to the eof function in place of () ! 256: These functions are discussed below, under values available to ! 257: actions. ! 258: ! 259: C. Rules ! 260: ! 261: Each rule has the format: ! 262: ! 263: <start state list> {regular expression} => ( ... code ... ); ! 264: ! 265: All parentheses in ... code ... must be balanced, including those ! 266: used in strings and comments. ! 267: ! 268: The start state list is optional. It consists of a list of ! 269: identifiers separated by commas, and is delimited by triangle ! 270: brackets < >. Each identifier must be a start state defined in the ! 271: %s section above. ! 272: ! 273: The regular expression is only recognized when the lexer is in one of ! 274: the start states in the start state list. If no start state list is ! 275: given, the expression is recognized in all start states. ! 276: ! 277: The lexer begins in a pre-defined start state called INITIAL. ! 278: ! 279: The lexer resolves conflicts among rules by choosing the rule with ! 280: the longest match, and in the case two rules match the same string, ! 281: choosing the rule listed first in the specification. ! 282: ! 283: The rules should match all possible input. If some input occurs that ! 284: does not match any rule, the lexer created by ML-Lex will raise an ! 285: exception LexError. Note that this differs from C Lex, which prints ! 286: any unmatched input on the standard output. ! 287: ! 288: V. Values available inside the code associated with a rule. ! 289: ! 290: Mlex places the value of the string matched by a regular expression ! 291: in yytext. Yytext has type string. The user may also recursively ! 292: call the lexing function with lex(). ! 293: ! 294: This is convenient for ignoring white space or comments silently: ! 295: ! 296: [\ \t\n]+ => ( lex()); ! 297: ! 298: To switch start states, the user may call YYBEGIN with the name of a ! 299: start state. ! 300: ! 301: The following values will be available only if the corresponding % ! 302: command is in the ML-Lex definitions sections: ! 303: ! 304: value %command description ! 305: ----- -------- ----------- ! 306: REJECT %reject REJECT() causes the current ! 307: rule to be "rejected." ! 308: The lexer behaves as if the ! 309: current rule had not matched; ! 310: another rule that matches this ! 311: string, or that matches the longest ! 312: possible prefix of this string, ! 313: is used instead. ! 314: ! 315: yylineno %count Current line number ! 316: ! 317: ! 318: These values should be used only if necessary. Adding REJECT to a ! 319: lexer will slow it down by 20%; adding yylineno will slow it down by ! 320: another 20%, or more. (It is much more efficient to recognize \n and ! 321: have an action that increments the line-number variable.) The use of ! 322: the lookahead operator / will also slow down the entire lexer. ! 323: ! 324: VI. Running ML-Lex ! 325: ! 326: Use "lexgen.sml"; this will create a structure LexGen. The function ! 327: LexGen.lexGen creates a program for a lexer from an input ! 328: specification. It takes a string argument -- the name of the file ! 329: containing the input specification. The output file name is ! 330: determined by appending ".sml" to the input file name. ! 331: ! 332: VII. Using the program produced by ML-Lex. ! 333: ! 334: When the output file is loaded, it will create a structure Mlex that ! 335: contains the function makeLexer. makeLexer takes a function from int ! 336: -> string and returns a lexing function. ! 337: ! 338: For example, ! 339: ! 340: val lexer = Mlex.makeLexer (input (open_in "f")) ! 341: ! 342: creates a lexer that operates on the file whose name is f. ! 343: ! 344: The function from int -> string should read a string of characters ! 345: from the input stream. It should return a null string to indicate ! 346: that the end of the stream has been reached. The integer is the ! 347: number of characters that the lexer wishes to read; the function may ! 348: return any non-zero number of characters. For example, ! 349: val lexer = Mlex.makeLexer (fn n => input_line std_in) ! 350: is appropriate for interactive streams where prompting, etc. occurs; ! 351: the lexer won't care that input_line might return a string of more ! 352: than or less than n characters. ! 353: ! 354: The lexer tries to read a large number of characters from the input ! 355: function at once, and it is desirable that the input function return ! 356: as many as possible. Reading many characters at once makes the lexer ! 357: more efficient. Fewer input calls and buffering operations are ! 358: needed, and input is more efficient in large block reads. For ! 359: interactive streams this is less of a concern, as the limiting factor ! 360: is the speed at which the user can type. ! 361: ! 362: To obtain a value, invoke the lexer by passing it a unit: ! 363: ! 364: val nextToken = lexer() ! 365: ! 366: If one wanted to restart the lexer, one would just discard "lexer" ! 367: and create a new lexer on the same stream with another call to ! 368: makeLexer. This is the best way to discard any characters buffered ! 369: internally by the lexer. ! 370: ! 371: All code in the user declarations section is placed inside a ! 372: structure UserDeclarations. To access this structure, use the path name ! 373: Mlex.UserDeclarations. ! 374: ! 375: If any input cannot be matched, the program will raise the exception ! 376: Mlex.LexError. An internal error (i.e. bug) will cause the ! 377: exception Internal.LexerError to be raised. ! 378: ! 379: If %structure is used, remember that the structure name will no ! 380: longer be Mlex, but the one specified in the command. ! 381: ! 382: VIII. Sample ! 383: ! 384: Here is a sample lexer for a calculator program: ! 385: ! 386: datatype lexresult= DIV | EOF | EOS | ID of string | LPAREN | ! 387: NUM of int | PLUS | PRINT | RPAREN | SUB | TIMES ! 388: ! 389: val linenum = ref 1 ! 390: val error = fn x => output std_out (x ^ "\n") ! 391: val eof = fn () => EOF ! 392: %% ! 393: %structure CalcLex ! 394: alpha=[A-Za-z]; ! 395: digit=[0-9]; ! 396: ws = [\ \t]; ! 397: %% ! 398: \n => (inc linenum; lex()); ! 399: {ws}+ => (lex()); ! 400: "/" => (DIV); ! 401: ";" => (EOS); ! 402: "(" => (LPAREN); ! 403: {digit}+ => (NUM (revfold (fn(a,r)=>ord(a)-ord("0")+10*r) (explode yytext) 0)); ! 404: ")" => (RPAREN); ! 405: "+" => (PLUS); ! 406: {alpha}+ => (if yytext="print" then PRINT else ID yytext); ! 407: "-" => (SUB); ! 408: "*" => (TIMES); ! 409: . => (error ("calc: ignoring bad character "^yytext); lex()); ! 410: ! 411: ! 412: Here is the parser for the calculator: ! 413: ! 414: (* Sample interactive calculator to demonstrate use of lexer produced by ML-Lex ! 415: ! 416: The original grammar was ! 417: ! 418: stmt_list -> stmt_list stmt ! 419: stmt -> print exp ; | exp ; ! 420: exp -> exp + t | exp - t | t ! 421: t -> t * f | t/f | f ! 422: f -> (exp) | id | num ! 423: ! 424: The function parse takes a stream and parses it for the calculator ! 425: program. ! 426: ! 427: If a syntax error occurs, parse prints an error message and calls itself ! 428: on the stream. On this system that has the effect of ignoring all input ! 429: to the end of a line. ! 430: *) ! 431: ! 432: structure Calc = ! 433: struct ! 434: open CalcLex ! 435: open UserDeclarations ! 436: exception Error ! 437: fun parse strm = ! 438: let ! 439: val say = output std_out ! 440: val lexer = makeLexer (fn n => input strm (min(1,max(can_input strm,n)))) ! 441: val nexttok = ref (lexer()) ! 442: val advance = fn () => (nexttok := lexer(); !nexttok) ! 443: val error = fn () => (say ("calc: syntax error on line" ^ ! 444: (makestring(!linenum)) ^ "\n"); raise Error) ! 445: val lookup = fn i => ! 446: if i = "ONE" then 1 ! 447: else if i = "TWO" then 2 ! 448: else (say ("calc: unknown identifier '" ^ i ^ "'\n"); raise Error) ! 449: fun STMT_LIST () = ! 450: case !nexttok of ! 451: EOF => () ! 452: | _ => (STMT(); STMT_LIST()) ! 453: ! 454: and STMT() = ! 455: (case !nexttok ! 456: of EOS => () ! 457: | PRINT => (advance(); say ((makestring (E():int)) ^ "\n"); ()) ! 458: | _ => (E(); ()); ! 459: case !nexttok ! 460: of EOS => (advance()) ! 461: | _ => error()) ! 462: and E () = E' (T()) ! 463: and E' (i : int ) = ! 464: case !nexttok of ! 465: PLUS => (advance (); E'(i+T())) ! 466: | SUB => (advance (); E'(i-T())) ! 467: | RPAREN => i ! 468: | EOF => i ! 469: | EOS => i ! 470: | _ => error() ! 471: and T () = T'(F()) ! 472: and T' i = ! 473: case !nexttok of ! 474: PLUS => i ! 475: | SUB => i ! 476: | TIMES => (advance(); T'(i*F())) ! 477: | DIV => (advance (); T'(i div F())) ! 478: | EOF => i ! 479: | EOS => i ! 480: | RPAREN => i ! 481: | _ => error() ! 482: and F () = ! 483: case !nexttok of ! 484: ID i => (advance(); lookup i) ! 485: | LPAREN => ! 486: let val v = (advance(); E()) ! 487: in if !nexttok = RPAREN then (advance (); v) else error() ! 488: end ! 489: | NUM i => (advance(); i) ! 490: | _ => error() ! 491: in STMT_LIST () handle Error => parse strm ! 492: end ! 493: end
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.