Annotation of researchv10no/cmd/sml/lib/lexgen/lexgen.doc, revision 1.1

1.1     ! root        1:         A lexical analyzer generator for Standard ML.
        !             2: 
        !             3: 
        !             4:                     Andrew W. Appel
        !             5:                     James S. Mattson
        !             6:                     David R. Tarditi
        !             7: 
        !             8:                     Princeton University
        !             9: 
        !            10:                     Version 1.2, October 1989
        !            11: 
        !            12: Copyright (c) 1989 by Andrew W. Appel, James S. Mattson, David R. Tarditi
        !            13: 
        !            14: This software comes with ABSOLUTELY NO WARRANTY.
        !            15: This software is subject only to the PRINCETON STANDARD ML SOFTWARE LIBRARY
        !            16: COPYRIGHT NOTICE, LICENSE AND DISCLAIMER, (in the file "COPYRIGHT",
        !            17: distributed with this software). You may copy and distribute this software;
        !            18: see the COPYRIGHT NOTICE for details and restrictions.
        !            19: 
        !            20: I. General Description
        !            21: 
        !            22: Computer programs often need to divide their input into words and
        !            23: distinguish between different kinds of words.  Compilers, for
        !            24: example, need to distinguish between integers, reserved words, and
        !            25: identifiers.  Applications programs often need to be able to
        !            26: recognize components of typed commands from users.
        !            27: 
        !            28: The problem of segmenting input into words and recognizing classes of
        !            29: words is known as lexical analysis.  Small cases of this problem,
        !            30: such as reading text strings separated by spaces, can be solved by
        !            31: using hand-written programs.  Larger cases of this problem, such as
        !            32: tokenizing an input stream for a compiler, can also be solved using
        !            33: hand-written programs.
        !            34: 
        !            35: A hand-written program for a large lexical analysis problem, however,
        !            36: suffers from two major problems.  First, the program requires a fair
        !            37: amount of programmer time to create.  Second, the description of
        !            38: classes of words is not explicit in the program.  It must be inferred
        !            39: from the program code.  This makes it difficult to verify if the
        !            40: program recognizes the correct words for each class.  It also makes
        !            41: future maintenance of the program difficult.
        !            42: 
        !            43: Lex, a programming tool for the Unix system, is a successful solution
        !            44: to the general problem of lexical analysis.  It uses regular
        !            45: expressions to describe classes of words.  A program fragment is
        !            46: associated with each class of words.  This information is given to
        !            47: Lex as a specification (a Lex program).  Lex produces a program for a
        !            48: function that can be used to perform lexical analysis.
        !            49: 
        !            50: The function operates as follows.  It finds the longest word starting
        !            51: from the current position in the input stream that is in one of the
        !            52: word classes.  It executes the program fragment associated with the
        !            53: class, and sets the current position in the input stream to be the
        !            54: character after the word.  The program fragment has the actual text
        !            55: of the word available to it, and may be any piece of code.  For many
        !            56: applications it returns some kind of value.
        !            57: 
        !            58: Lex allows the programmer to make the language description explicit,
        !            59: and to concentrate on what to do with the recognized words, not how
        !            60: to recognize the words.  It saves programmer time and increases
        !            61: program maintainability.
        !            62: 
        !            63: Unfortunately, Lex is targeted only C.  It also places artificial 
        !            64: limits on the size of strings that can be recognized.
        !            65: 
        !            66: ML-Lex is a variant of Lex for the ML programming language.  ML-Lex
        !            67: has a syntax similar to Lex, and produces an ML program instead of a
        !            68: C program.  ML-Lex produces a program that runs very efficiently.
        !            69: Typically the program will be as fast or even faster than a
        !            70: hand-coded lexer implemented in Standard ML.
        !            71: 
        !            72: The program typically uses only a small amount of space.
        !            73: ML-Lex thus allows ML programmers the same benefits that Lex allows C
        !            74: programmers.  It also does not place artificial limits on the size of
        !            75: recognized strings.
        !            76: 
        !            77: II. ML-Lex specifications
        !            78: 
        !            79: An ML-Lex specification has the general format:
        !            80: 
        !            81:         {user declarations}
        !            82:         %%
        !            83:         {ML-Lex definitions}
        !            84:         %%
        !            85:         {rules}
        !            86: 
        !            87: Each section is separated from the others by a '%%' delimiter.
        !            88: 
        !            89: The rules are used to define the lexical analysis function.  Each
        !            90: rule has two parts - a regular expression and an action.  The regular
        !            91: expression defines the word class that a rule matches.  The action is
        !            92: a program fragment to be executed when a rule matches the input.  The
        !            93: actions are used to compute values, and must all return values of the
        !            94: same type.
        !            95: 
        !            96: The user can define values available to all rule actions in the user
        !            97: declarations section.  The user must define two values in this
        !            98: section - a type lexresult and a function eof.  Lexresult defines the
        !            99: type of values returned by the rule actions.  The function "eof" is
        !           100: called by the lexer when the end of the input stream is reached.  It
        !           101: will typically return a value signalling eof or raise an exception.
        !           102: It is called with the same argument as lex (see %arg, below),
        !           103: and must return a value of type lexresult.
        !           104: 
        !           105: In the definitions section, the user can define named regular
        !           106: expressions, a set of start states, and specify which of the various
        !           107: bells and whistles of ML-Lex are desired.
        !           108: 
        !           109: The start states allow the user to control when certain rules are
        !           110: matched.  Rules may be defined to match only when the lexer is in
        !           111: specific start states.  The user may change the lexer's start state
        !           112: in a rule action.  This allows the user to specify special handling
        !           113: of lexical objects.
        !           114: 
        !           115: This feature is typically used to handle quoted strings with escapes
        !           116: to denote special characters.  The rules to recognize the inside
        !           117: contents of a string are defined for only one start state.  This
        !           118: start state is entered when the beginning of a string is recognized,
        !           119: and exited when the end of the string is recognized.
        !           120: 
        !           121: III. Regular expressions.
        !           122: 
        !           123: Regular expressions are a simple language for denoting classes of
        !           124: strings.  A regular expression is defined inductively over an
        !           125: alphabet with a set of basic operations.  The alphabet for ML-Lex is
        !           126: the Ascii character set (character codes 0-127; or if %full is used, 0-255).
        !           127: 
        !           128: The syntax and semantics of regular expressions will be described in
        !           129: order of decreasing precedence (from the most tightly-binding operators
        !           130: to the most weakly-binding):
        !           131: 
        !           132:        An individual character stands for itself, except for the
        !           133:        reserved characters  ? * + | ( ) ^ $ / ; . = < > [ { " \
        !           134: 
        !           135:        A backslash followed by one of the reserved characters stands
        !           136:        for that character.
        !           137: 
        !           138:        A set of characters enclosed in square brackets [ ] stands
        !           139:        for any one of those characters.  Inside the brackets, only
        !           140:        the symbols  \ - ^ are reserved.  An initial up-arrow ^ stands
        !           141:        for the complement of the characters listed, e.g. [^abc]
        !           142:        stands any character except a, b, or c.  The hyphen - denotes
        !           143:        a range of characters, e.g. [a-z] stands for any lower-case
        !           144:        alphabetic character, and [0-9a-fA-F] stands for any hexadecimal
        !           145:        digit.  To include ^ literally in a bracketed set, put it anywhere
        !           146:        but first; to include - literally in a set, put it first or last.
        !           147: 
        !           148:        The dot . character stands for any character except newline,
        !           149:        i.e. the same as [^\n]  
        !           150: 
        !           151:        The following special escape sequences are available, inside
        !           152:        or outside of square-brackets:
        !           153:                \b      - backspace
        !           154:                \n      - newline
        !           155:                \t      - tab
        !           156:                \ddd    - where ddd is a 3 digit decimal escape.
        !           157: 
        !           158:        A sequence of characters will stand for itself (reserved
        !           159:         characters will be taken literally) if it is enclosed in
        !           160:        double quotes " ".
        !           161: 
        !           162:        A named regular expression (defined in the "definitions"
        !           163:        section) may be referred to by enclosing its name in
        !           164:        braces { }.
        !           165: 
        !           166:        Any regular expression may be enclosed in parentheses ( )
        !           167:        for syntactic (but, as usual, not semantic) effect.
        !           168: 
        !           169:        The postfix operator * stands for Kleene closure:
        !           170:        zero or more repetitions of the preceding expression.
        !           171: 
        !           172:        The postfix operator + stands for one or more repetitions
        !           173:        of the preceding expression.
        !           174: 
        !           175:        The postfix operator ? stands for zero or one occurrence of
        !           176:        the preceding expression.
        !           177: 
        !           178:        A postfix repetition range {n1,n2} where n1 and n2 are small
        !           179:        integers stands for any number of repetitions between n1 and n2
        !           180:        of the preceding expression.  The notation {n1} stands for
        !           181:        exactly n1 repetitions.
        !           182: 
        !           183:        Concatenation of expressions denotes concatenation of strings.
        !           184:        The expression e1 e2 stands for any string that results from
        !           185:        the concatenation of one string that matches e1 with another
        !           186:        string that matches e2.
        !           187: 
        !           188:        The infix operator | stands for alternation.  The expression
        !           189:        e1 | e2  stands for anything that either e1 or e2 stands for.
        !           190:     
        !           191:        The infix operator / denotes lookahead.  The expression e1 / e2
        !           192:        matches any string that e1 stands for, but only when that
        !           193:        string is followed by a string that matches e2.
        !           194: 
        !           195:        When the up-arrow ^ occurs at the beginning of an expression,
        !           196:        that expression will only match strings that occur at the
        !           197:        beginning of a line (right after a newline character).
        !           198:        When the dollar sign $ occurs at the end of an expression,
        !           199:        that expression will only match strings that occur at the
        !           200:        end of a line (right before a newline character).
        !           201:        
        !           202: Here are some examples of regular expressions, and descriptions of the
        !           203: set of strings they denote:
        !           204: 
        !           205:         0 | 1 | 2 | 3           A single digit between 0 and 3
        !           206:        [0123]                  A single digit between 0 and 3
        !           207:         0123                    The string "0123"
        !           208:         0*                      All strings of 0 or more 0's
        !           209:         00*                     All strings of 1  or more 0's
        !           210:        0+                      All strings of 1  or more 0's
        !           211:        [0-9]{3}                Any three-digit decimal number.
        !           212:        \\[ntb]                 The strings "\n" "\t" "\b"
        !           213:        (00)*                   Any string with an even number of 0's.
        !           214: 
        !           215: IV. ML-Lex syntax summary
        !           216: 
        !           217: A. User declarations
        !           218: 
        !           219: Anything up to the first %% is in the user declarations section.  The
        !           220: user should note that no symbolic identifier containing '%%' can be
        !           221: used in this section.
        !           222: 
        !           223: B. ML-Lex definitions
        !           224: 
        !           225: Start states can be defined with
        !           226: 
        !           227:                 %s {identifier list} ;
        !           228: 
        !           229:         or      %S {identifier list} ;
        !           230: 
        !           231: An identifier list consists of one or more identifiers.
        !           232: 
        !           233: An identifier consists of one or more letters, digits, underscores,
        !           234: or primes.  It must begin with a letter.
        !           235: 
        !           236: Named expressions can be defined with
        !           237: 
        !           238:         {identifier} = {regular expression} ;
        !           239: 
        !           240: Regular expressions are defined below.
        !           241: 
        !           242: The following % commands are also available:
        !           243: 
        !           244:         %reject         - create REJECT() function
        !           245:         %count          - count newlines using yylineno
        !           246:         %full           - create lexer for the full 8-bit character set,
        !           247:                           with characters in the range 0-255 permitted
        !           248:                           as input.
        !           249:         %structure {identifier} - name the structure in the output program
        !           250:                           {identifier} instead of Mlex
        !           251:        %header         - use code following it to create header for lexer
        !           252:                          structure
        !           253:         %arg            - extra (curried) formal parameter argument to be
        !           254:                          passed to the lex functions, and to be passed
        !           255:                          to the eof function in place of ()
        !           256:         These functions are discussed below, under values available to
        !           257:         actions.
        !           258: 
        !           259: C.  Rules
        !           260: 
        !           261: Each rule has the format:
        !           262: 
        !           263:         <start state list> {regular expression} => ( ... code ... );
        !           264: 
        !           265: All parentheses in ...  code ...  must be balanced, including those
        !           266: used in strings and comments.
        !           267: 
        !           268: The start state list is optional.  It consists of a list of
        !           269: identifiers separated by commas, and is delimited by triangle
        !           270: brackets < >.  Each identifier must be a start state defined in the
        !           271: %s section above.
        !           272: 
        !           273: The regular expression is only recognized when the lexer is in one of
        !           274: the start states in the start state list.  If no start state list is
        !           275: given, the expression is recognized in all start states.
        !           276: 
        !           277: The lexer begins in a pre-defined start state called INITIAL.
        !           278: 
        !           279: The lexer resolves conflicts among rules by choosing the rule with
        !           280: the longest match, and in the case two rules match the same string,
        !           281: choosing the rule listed first in the specification.
        !           282: 
        !           283: The rules should match all possible input.  If some input occurs that
        !           284: does not match any rule, the lexer created by ML-Lex will raise an
        !           285: exception LexError.  Note that this differs from C Lex, which prints
        !           286: any unmatched input on the standard output.
        !           287: 
        !           288: V. Values available inside the code associated with a rule.
        !           289: 
        !           290: Mlex places the value of the string matched by a regular expression
        !           291: in yytext.  Yytext has type string.  The user may also recursively
        !           292: call the lexing function with lex().
        !           293: 
        !           294: This is convenient for ignoring white space or comments silently:
        !           295: 
        !           296:         [\ \t\n]+       => ( lex());
        !           297: 
        !           298: To switch start states, the user may call YYBEGIN with the name of a
        !           299: start state.
        !           300: 
        !           301: The following values will be available only if the corresponding %
        !           302: command is in the ML-Lex definitions sections:
        !           303: 
        !           304:         value           %command        description
        !           305:         -----           --------        -----------
        !           306:         REJECT          %reject         REJECT() causes the current
        !           307:                                        rule to be "rejected."
        !           308:                                        The lexer behaves as if the
        !           309:                                        current rule had not matched;
        !           310:                                        another rule that matches this
        !           311:                                        string, or that matches the longest
        !           312:                                        possible prefix of this string,
        !           313:                                        is used instead.
        !           314:                                 
        !           315:         yylineno        %count          Current line number
        !           316:         
        !           317: 
        !           318: These values should be used only if necessary.  Adding REJECT to a
        !           319: lexer will slow it down by 20%; adding yylineno will slow it down by
        !           320: another 20%, or more.  (It is much more efficient to recognize \n and
        !           321: have an action that increments the line-number variable.)  The use of
        !           322: the lookahead operator / will also slow down the entire lexer.
        !           323: 
        !           324: VI. Running ML-Lex
        !           325: 
        !           326: Use "lexgen.sml"; this will create a structure LexGen.  The function
        !           327: LexGen.lexGen creates a program for a lexer from an input
        !           328: specification.  It takes a string argument -- the name of the file
        !           329: containing the input specification.  The output file name is
        !           330: determined by appending ".sml" to the input file name.
        !           331: 
        !           332: VII. Using the program produced by ML-Lex.
        !           333: 
        !           334: When the output file is loaded, it will create a structure Mlex that
        !           335: contains the function makeLexer.  makeLexer takes a function from int
        !           336: -> string and returns a lexing function.
        !           337: 
        !           338: For example,
        !           339: 
        !           340:         val lexer = Mlex.makeLexer (input (open_in "f"))
        !           341: 
        !           342: creates a lexer that operates on the file whose name is f.
        !           343: 
        !           344: The function from int -> string should read a string of characters
        !           345: from the input stream.  It should return a null string to indicate
        !           346: that the end of the stream has been reached.  The integer is the
        !           347: number of characters that the lexer wishes to read; the function may
        !           348: return any non-zero number of characters.  For example, 
        !           349:         val lexer = Mlex.makeLexer (fn n => input_line std_in)
        !           350: is appropriate for interactive streams where prompting, etc.  occurs;
        !           351: the lexer won't care that input_line might return a string of more
        !           352: than or less than n characters.
        !           353: 
        !           354: The lexer tries to read a large number of characters from the input
        !           355: function at once, and it is desirable that the input function return
        !           356: as many as possible.  Reading many characters at once makes the lexer
        !           357: more efficient.  Fewer input calls and buffering operations are
        !           358: needed, and input is more efficient in large block reads.  For
        !           359: interactive streams this is less of a concern, as the limiting factor
        !           360: is the speed at which the user can type.
        !           361: 
        !           362: To obtain a value, invoke the lexer by passing it a unit:
        !           363: 
        !           364:         val nextToken = lexer()
        !           365: 
        !           366: If one wanted to restart the lexer, one would just discard "lexer"
        !           367: and create a new lexer on the same stream with another call to
        !           368: makeLexer.  This is the best way to discard any characters buffered
        !           369: internally by the lexer.
        !           370: 
        !           371: All code in the user declarations section is placed inside a
        !           372: structure UserDeclarations.  To access this structure, use the path name
        !           373: Mlex.UserDeclarations.
        !           374: 
        !           375: If any input cannot be matched, the program will raise the exception
        !           376: Mlex.LexError.  An internal error (i.e.  bug) will cause the
        !           377: exception Internal.LexerError to be raised.
        !           378: 
        !           379: If %structure is used, remember that the structure name will no
        !           380: longer be Mlex, but the one specified in the command.
        !           381: 
        !           382: VIII. Sample
        !           383: 
        !           384: Here is a sample lexer for a calculator program:
        !           385: 
        !           386: datatype lexresult= DIV | EOF | EOS | ID of string | LPAREN |
        !           387:                      NUM of int | PLUS | PRINT | RPAREN | SUB | TIMES 
        !           388: 
        !           389: val linenum = ref 1
        !           390: val error = fn x => output std_out (x ^ "\n")
        !           391: val eof = fn () => EOF
        !           392: %%
        !           393: %structure CalcLex
        !           394: alpha=[A-Za-z];
        !           395: digit=[0-9];
        !           396: ws = [\ \t];
        !           397: %%
        !           398: \n       => (inc linenum; lex());
        !           399: {ws}+    => (lex());
        !           400: "/"      => (DIV);
        !           401: ";"      => (EOS);
        !           402: "("      => (LPAREN);
        !           403: {digit}+ => (NUM (revfold (fn(a,r)=>ord(a)-ord("0")+10*r) (explode yytext) 0));
        !           404: ")"      => (RPAREN);
        !           405: "+"      => (PLUS);
        !           406: {alpha}+ => (if yytext="print" then PRINT else ID yytext);
        !           407: "-"      => (SUB);
        !           408: "*"      => (TIMES);
        !           409: .        => (error ("calc: ignoring bad character "^yytext); lex());
        !           410: 
        !           411: 
        !           412: Here is the parser for the calculator:
        !           413: 
        !           414: (* Sample interactive calculator to demonstrate use of lexer produced by ML-Lex
        !           415:  
        !           416:    The original grammar was
        !           417: 
        !           418:        stmt_list -> stmt_list stmt
        !           419:        stmt -> print exp ;  | exp ;
        !           420:        exp -> exp + t | exp - t | t
        !           421:        t -> t * f | t/f | f
        !           422:        f -> (exp) | id | num
        !           423: 
        !           424:   The function parse takes a stream and parses it for the calculator 
        !           425:   program.
        !           426: 
        !           427:   If a syntax error occurs, parse prints an error message and calls itself
        !           428:   on the stream.  On this system that has the effect of ignoring all input
        !           429:   to the end of a line.
        !           430: *)
        !           431:        
        !           432: structure Calc =
        !           433:  struct
        !           434:    open CalcLex
        !           435:    open UserDeclarations
        !           436:    exception Error
        !           437:    fun parse strm =
        !           438:     let
        !           439:       val say = output std_out
        !           440:       val lexer = makeLexer (fn n => input strm (min(1,max(can_input strm,n))))
        !           441:       val nexttok = ref (lexer())
        !           442:       val advance = fn () => (nexttok := lexer(); !nexttok)
        !           443:       val error = fn () => (say ("calc: syntax error on line" ^
        !           444:                            (makestring(!linenum)) ^ "\n"); raise Error)
        !           445:       val lookup = fn i =>
        !           446:         if i = "ONE" then 1
        !           447:         else if i = "TWO" then 2
        !           448:         else  (say ("calc: unknown identifier '" ^ i ^ "'\n"); raise Error)
        !           449:      fun STMT_LIST () =
        !           450:          case !nexttok of
        !           451:             EOF => ()
        !           452:           | _ => (STMT(); STMT_LIST())
        !           453:         
        !           454:      and STMT() =
        !           455:          (case !nexttok
        !           456:            of EOS  => ()
        !           457:             | PRINT => (advance(); say ((makestring (E():int)) ^ "\n"); ())
        !           458:             | _ => (E(); ());
        !           459:          case !nexttok
        !           460:            of EOS => (advance())
        !           461:             | _ => error())
        !           462:      and E () = E' (T())
        !           463:      and E' (i : int ) =
        !           464:          case !nexttok of
        !           465:             PLUS => (advance (); E'(i+T()))
        !           466:           | SUB => (advance (); E'(i-T()))
        !           467:           | RPAREN => i
        !           468:           | EOF => i
        !           469:           | EOS => i
        !           470:           | _ => error()
        !           471:      and T () =  T'(F())
        !           472:      and T' i =
        !           473:         case !nexttok of
        !           474:             PLUS => i
        !           475:           | SUB => i
        !           476:           | TIMES => (advance(); T'(i*F()))
        !           477:           | DIV => (advance (); T'(i div F()))
        !           478:           | EOF => i
        !           479:           | EOS => i
        !           480:           | RPAREN => i
        !           481:           | _ => error()
        !           482:      and F () =
        !           483:         case !nexttok of
        !           484:             ID i => (advance(); lookup i)
        !           485:           | LPAREN =>
        !           486:               let val v = (advance(); E())
        !           487:               in if !nexttok = RPAREN then (advance (); v) else error()
        !           488:               end
        !           489:           | NUM i => (advance(); i)
        !           490:           | _ => error()
        !           491:     in STMT_LIST () handle Error => parse strm
        !           492:     end
        !           493:  end

unix.superglobalmegacorp.com

This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.