Annotation of researchv10dc/cmd/sml/lib/lexgen/lexgen.doc, revision 1.1.1.1

1.1       root        1:         A lexical analyzer generator for Standard ML.
                      2: 
                      3: 
                      4:                     Andrew W. Appel
                      5:                     James S. Mattson
                      6:                     David R. Tarditi
                      7: 
                      8:                     Princeton University
                      9: 
                     10:                     Version 1.2, October 1989
                     11: 
                     12: Copyright (c) 1989 by Andrew W. Appel, James S. Mattson, David R. Tarditi
                     13: 
                     14: This software comes with ABSOLUTELY NO WARRANTY.
                     15: This software is subject only to the PRINCETON STANDARD ML SOFTWARE LIBRARY
                     16: COPYRIGHT NOTICE, LICENSE AND DISCLAIMER, (in the file "COPYRIGHT",
                     17: distributed with this software). You may copy and distribute this software;
                     18: see the COPYRIGHT NOTICE for details and restrictions.
                     19: 
                     20: I. General Description
                     21: 
                     22: Computer programs often need to divide their input into words and
                     23: distinguish between different kinds of words.  Compilers, for
                     24: example, need to distinguish between integers, reserved words, and
                     25: identifiers.  Applications programs often need to be able to
                     26: recognize components of typed commands from users.
                     27: 
                     28: The problem of segmenting input into words and recognizing classes of
                     29: words is known as lexical analysis.  Small cases of this problem,
                     30: such as reading text strings separated by spaces, can be solved by
                     31: using hand-written programs.  Larger cases of this problem, such as
                     32: tokenizing an input stream for a compiler, can also be solved using
                     33: hand-written programs.
                     34: 
                     35: A hand-written program for a large lexical analysis problem, however,
                     36: suffers from two major problems.  First, the program requires a fair
                     37: amount of programmer time to create.  Second, the description of
                     38: classes of words is not explicit in the program.  It must be inferred
                     39: from the program code.  This makes it difficult to verify if the
                     40: program recognizes the correct words for each class.  It also makes
                     41: future maintenance of the program difficult.
                     42: 
                     43: Lex, a programming tool for the Unix system, is a successful solution
                     44: to the general problem of lexical analysis.  It uses regular
                     45: expressions to describe classes of words.  A program fragment is
                     46: associated with each class of words.  This information is given to
                     47: Lex as a specification (a Lex program).  Lex produces a program for a
                     48: function that can be used to perform lexical analysis.
                     49: 
                     50: The function operates as follows.  It finds the longest word starting
                     51: from the current position in the input stream that is in one of the
                     52: word classes.  It executes the program fragment associated with the
                     53: class, and sets the current position in the input stream to be the
                     54: character after the word.  The program fragment has the actual text
                     55: of the word available to it, and may be any piece of code.  For many
                     56: applications it returns some kind of value.
                     57: 
                     58: Lex allows the programmer to make the language description explicit,
                     59: and to concentrate on what to do with the recognized words, not how
                     60: to recognize the words.  It saves programmer time and increases
                     61: program maintainability.
                     62: 
                     63: Unfortunately, Lex is targeted only C.  It also places artificial 
                     64: limits on the size of strings that can be recognized.
                     65: 
                     66: ML-Lex is a variant of Lex for the ML programming language.  ML-Lex
                     67: has a syntax similar to Lex, and produces an ML program instead of a
                     68: C program.  ML-Lex produces a program that runs very efficiently.
                     69: Typically the program will be as fast or even faster than a
                     70: hand-coded lexer implemented in Standard ML.
                     71: 
                     72: The program typically uses only a small amount of space.
                     73: ML-Lex thus allows ML programmers the same benefits that Lex allows C
                     74: programmers.  It also does not place artificial limits on the size of
                     75: recognized strings.
                     76: 
                     77: II. ML-Lex specifications
                     78: 
                     79: An ML-Lex specification has the general format:
                     80: 
                     81:         {user declarations}
                     82:         %%
                     83:         {ML-Lex definitions}
                     84:         %%
                     85:         {rules}
                     86: 
                     87: Each section is separated from the others by a '%%' delimiter.
                     88: 
                     89: The rules are used to define the lexical analysis function.  Each
                     90: rule has two parts - a regular expression and an action.  The regular
                     91: expression defines the word class that a rule matches.  The action is
                     92: a program fragment to be executed when a rule matches the input.  The
                     93: actions are used to compute values, and must all return values of the
                     94: same type.
                     95: 
                     96: The user can define values available to all rule actions in the user
                     97: declarations section.  The user must define two values in this
                     98: section - a type lexresult and a function eof.  Lexresult defines the
                     99: type of values returned by the rule actions.  The function "eof" is
                    100: called by the lexer when the end of the input stream is reached.  It
                    101: will typically return a value signalling eof or raise an exception.
                    102: It is called with the same argument as lex (see %arg, below),
                    103: and must return a value of type lexresult.
                    104: 
                    105: In the definitions section, the user can define named regular
                    106: expressions, a set of start states, and specify which of the various
                    107: bells and whistles of ML-Lex are desired.
                    108: 
                    109: The start states allow the user to control when certain rules are
                    110: matched.  Rules may be defined to match only when the lexer is in
                    111: specific start states.  The user may change the lexer's start state
                    112: in a rule action.  This allows the user to specify special handling
                    113: of lexical objects.
                    114: 
                    115: This feature is typically used to handle quoted strings with escapes
                    116: to denote special characters.  The rules to recognize the inside
                    117: contents of a string are defined for only one start state.  This
                    118: start state is entered when the beginning of a string is recognized,
                    119: and exited when the end of the string is recognized.
                    120: 
                    121: III. Regular expressions.
                    122: 
                    123: Regular expressions are a simple language for denoting classes of
                    124: strings.  A regular expression is defined inductively over an
                    125: alphabet with a set of basic operations.  The alphabet for ML-Lex is
                    126: the Ascii character set (character codes 0-127; or if %full is used, 0-255).
                    127: 
                    128: The syntax and semantics of regular expressions will be described in
                    129: order of decreasing precedence (from the most tightly-binding operators
                    130: to the most weakly-binding):
                    131: 
                    132:        An individual character stands for itself, except for the
                    133:        reserved characters  ? * + | ( ) ^ $ / ; . = < > [ { " \
                    134: 
                    135:        A backslash followed by one of the reserved characters stands
                    136:        for that character.
                    137: 
                    138:        A set of characters enclosed in square brackets [ ] stands
                    139:        for any one of those characters.  Inside the brackets, only
                    140:        the symbols  \ - ^ are reserved.  An initial up-arrow ^ stands
                    141:        for the complement of the characters listed, e.g. [^abc]
                    142:        stands any character except a, b, or c.  The hyphen - denotes
                    143:        a range of characters, e.g. [a-z] stands for any lower-case
                    144:        alphabetic character, and [0-9a-fA-F] stands for any hexadecimal
                    145:        digit.  To include ^ literally in a bracketed set, put it anywhere
                    146:        but first; to include - literally in a set, put it first or last.
                    147: 
                    148:        The dot . character stands for any character except newline,
                    149:        i.e. the same as [^\n]  
                    150: 
                    151:        The following special escape sequences are available, inside
                    152:        or outside of square-brackets:
                    153:                \b      - backspace
                    154:                \n      - newline
                    155:                \t      - tab
                    156:                \ddd    - where ddd is a 3 digit decimal escape.
                    157: 
                    158:        A sequence of characters will stand for itself (reserved
                    159:         characters will be taken literally) if it is enclosed in
                    160:        double quotes " ".
                    161: 
                    162:        A named regular expression (defined in the "definitions"
                    163:        section) may be referred to by enclosing its name in
                    164:        braces { }.
                    165: 
                    166:        Any regular expression may be enclosed in parentheses ( )
                    167:        for syntactic (but, as usual, not semantic) effect.
                    168: 
                    169:        The postfix operator * stands for Kleene closure:
                    170:        zero or more repetitions of the preceding expression.
                    171: 
                    172:        The postfix operator + stands for one or more repetitions
                    173:        of the preceding expression.
                    174: 
                    175:        The postfix operator ? stands for zero or one occurrence of
                    176:        the preceding expression.
                    177: 
                    178:        A postfix repetition range {n1,n2} where n1 and n2 are small
                    179:        integers stands for any number of repetitions between n1 and n2
                    180:        of the preceding expression.  The notation {n1} stands for
                    181:        exactly n1 repetitions.
                    182: 
                    183:        Concatenation of expressions denotes concatenation of strings.
                    184:        The expression e1 e2 stands for any string that results from
                    185:        the concatenation of one string that matches e1 with another
                    186:        string that matches e2.
                    187: 
                    188:        The infix operator | stands for alternation.  The expression
                    189:        e1 | e2  stands for anything that either e1 or e2 stands for.
                    190:     
                    191:        The infix operator / denotes lookahead.  The expression e1 / e2
                    192:        matches any string that e1 stands for, but only when that
                    193:        string is followed by a string that matches e2.
                    194: 
                    195:        When the up-arrow ^ occurs at the beginning of an expression,
                    196:        that expression will only match strings that occur at the
                    197:        beginning of a line (right after a newline character).
                    198:        When the dollar sign $ occurs at the end of an expression,
                    199:        that expression will only match strings that occur at the
                    200:        end of a line (right before a newline character).
                    201:        
                    202: Here are some examples of regular expressions, and descriptions of the
                    203: set of strings they denote:
                    204: 
                    205:         0 | 1 | 2 | 3           A single digit between 0 and 3
                    206:        [0123]                  A single digit between 0 and 3
                    207:         0123                    The string "0123"
                    208:         0*                      All strings of 0 or more 0's
                    209:         00*                     All strings of 1  or more 0's
                    210:        0+                      All strings of 1  or more 0's
                    211:        [0-9]{3}                Any three-digit decimal number.
                    212:        \\[ntb]                 The strings "\n" "\t" "\b"
                    213:        (00)*                   Any string with an even number of 0's.
                    214: 
                    215: IV. ML-Lex syntax summary
                    216: 
                    217: A. User declarations
                    218: 
                    219: Anything up to the first %% is in the user declarations section.  The
                    220: user should note that no symbolic identifier containing '%%' can be
                    221: used in this section.
                    222: 
                    223: B. ML-Lex definitions
                    224: 
                    225: Start states can be defined with
                    226: 
                    227:                 %s {identifier list} ;
                    228: 
                    229:         or      %S {identifier list} ;
                    230: 
                    231: An identifier list consists of one or more identifiers.
                    232: 
                    233: An identifier consists of one or more letters, digits, underscores,
                    234: or primes.  It must begin with a letter.
                    235: 
                    236: Named expressions can be defined with
                    237: 
                    238:         {identifier} = {regular expression} ;
                    239: 
                    240: Regular expressions are defined below.
                    241: 
                    242: The following % commands are also available:
                    243: 
                    244:         %reject         - create REJECT() function
                    245:         %count          - count newlines using yylineno
                    246:         %full           - create lexer for the full 8-bit character set,
                    247:                           with characters in the range 0-255 permitted
                    248:                           as input.
                    249:         %structure {identifier} - name the structure in the output program
                    250:                           {identifier} instead of Mlex
                    251:        %header         - use code following it to create header for lexer
                    252:                          structure
                    253:         %arg            - extra (curried) formal parameter argument to be
                    254:                          passed to the lex functions, and to be passed
                    255:                          to the eof function in place of ()
                    256:         These functions are discussed below, under values available to
                    257:         actions.
                    258: 
                    259: C.  Rules
                    260: 
                    261: Each rule has the format:
                    262: 
                    263:         <start state list> {regular expression} => ( ... code ... );
                    264: 
                    265: All parentheses in ...  code ...  must be balanced, including those
                    266: used in strings and comments.
                    267: 
                    268: The start state list is optional.  It consists of a list of
                    269: identifiers separated by commas, and is delimited by triangle
                    270: brackets < >.  Each identifier must be a start state defined in the
                    271: %s section above.
                    272: 
                    273: The regular expression is only recognized when the lexer is in one of
                    274: the start states in the start state list.  If no start state list is
                    275: given, the expression is recognized in all start states.
                    276: 
                    277: The lexer begins in a pre-defined start state called INITIAL.
                    278: 
                    279: The lexer resolves conflicts among rules by choosing the rule with
                    280: the longest match, and in the case two rules match the same string,
                    281: choosing the rule listed first in the specification.
                    282: 
                    283: The rules should match all possible input.  If some input occurs that
                    284: does not match any rule, the lexer created by ML-Lex will raise an
                    285: exception LexError.  Note that this differs from C Lex, which prints
                    286: any unmatched input on the standard output.
                    287: 
                    288: V. Values available inside the code associated with a rule.
                    289: 
                    290: Mlex places the value of the string matched by a regular expression
                    291: in yytext.  Yytext has type string.  The user may also recursively
                    292: call the lexing function with lex().
                    293: 
                    294: This is convenient for ignoring white space or comments silently:
                    295: 
                    296:         [\ \t\n]+       => ( lex());
                    297: 
                    298: To switch start states, the user may call YYBEGIN with the name of a
                    299: start state.
                    300: 
                    301: The following values will be available only if the corresponding %
                    302: command is in the ML-Lex definitions sections:
                    303: 
                    304:         value           %command        description
                    305:         -----           --------        -----------
                    306:         REJECT          %reject         REJECT() causes the current
                    307:                                        rule to be "rejected."
                    308:                                        The lexer behaves as if the
                    309:                                        current rule had not matched;
                    310:                                        another rule that matches this
                    311:                                        string, or that matches the longest
                    312:                                        possible prefix of this string,
                    313:                                        is used instead.
                    314:                                 
                    315:         yylineno        %count          Current line number
                    316:         
                    317: 
                    318: These values should be used only if necessary.  Adding REJECT to a
                    319: lexer will slow it down by 20%; adding yylineno will slow it down by
                    320: another 20%, or more.  (It is much more efficient to recognize \n and
                    321: have an action that increments the line-number variable.)  The use of
                    322: the lookahead operator / will also slow down the entire lexer.
                    323: 
                    324: VI. Running ML-Lex
                    325: 
                    326: Use "lexgen.sml"; this will create a structure LexGen.  The function
                    327: LexGen.lexGen creates a program for a lexer from an input
                    328: specification.  It takes a string argument -- the name of the file
                    329: containing the input specification.  The output file name is
                    330: determined by appending ".sml" to the input file name.
                    331: 
                    332: VII. Using the program produced by ML-Lex.
                    333: 
                    334: When the output file is loaded, it will create a structure Mlex that
                    335: contains the function makeLexer.  makeLexer takes a function from int
                    336: -> string and returns a lexing function.
                    337: 
                    338: For example,
                    339: 
                    340:         val lexer = Mlex.makeLexer (input (open_in "f"))
                    341: 
                    342: creates a lexer that operates on the file whose name is f.
                    343: 
                    344: The function from int -> string should read a string of characters
                    345: from the input stream.  It should return a null string to indicate
                    346: that the end of the stream has been reached.  The integer is the
                    347: number of characters that the lexer wishes to read; the function may
                    348: return any non-zero number of characters.  For example, 
                    349:         val lexer = Mlex.makeLexer (fn n => input_line std_in)
                    350: is appropriate for interactive streams where prompting, etc.  occurs;
                    351: the lexer won't care that input_line might return a string of more
                    352: than or less than n characters.
                    353: 
                    354: The lexer tries to read a large number of characters from the input
                    355: function at once, and it is desirable that the input function return
                    356: as many as possible.  Reading many characters at once makes the lexer
                    357: more efficient.  Fewer input calls and buffering operations are
                    358: needed, and input is more efficient in large block reads.  For
                    359: interactive streams this is less of a concern, as the limiting factor
                    360: is the speed at which the user can type.
                    361: 
                    362: To obtain a value, invoke the lexer by passing it a unit:
                    363: 
                    364:         val nextToken = lexer()
                    365: 
                    366: If one wanted to restart the lexer, one would just discard "lexer"
                    367: and create a new lexer on the same stream with another call to
                    368: makeLexer.  This is the best way to discard any characters buffered
                    369: internally by the lexer.
                    370: 
                    371: All code in the user declarations section is placed inside a
                    372: structure UserDeclarations.  To access this structure, use the path name
                    373: Mlex.UserDeclarations.
                    374: 
                    375: If any input cannot be matched, the program will raise the exception
                    376: Mlex.LexError.  An internal error (i.e.  bug) will cause the
                    377: exception Internal.LexerError to be raised.
                    378: 
                    379: If %structure is used, remember that the structure name will no
                    380: longer be Mlex, but the one specified in the command.
                    381: 
                    382: VIII. Sample
                    383: 
                    384: Here is a sample lexer for a calculator program:
                    385: 
                    386: datatype lexresult= DIV | EOF | EOS | ID of string | LPAREN |
                    387:                      NUM of int | PLUS | PRINT | RPAREN | SUB | TIMES 
                    388: 
                    389: val linenum = ref 1
                    390: val error = fn x => output std_out (x ^ "\n")
                    391: val eof = fn () => EOF
                    392: %%
                    393: %structure CalcLex
                    394: alpha=[A-Za-z];
                    395: digit=[0-9];
                    396: ws = [\ \t];
                    397: %%
                    398: \n       => (inc linenum; lex());
                    399: {ws}+    => (lex());
                    400: "/"      => (DIV);
                    401: ";"      => (EOS);
                    402: "("      => (LPAREN);
                    403: {digit}+ => (NUM (revfold (fn(a,r)=>ord(a)-ord("0")+10*r) (explode yytext) 0));
                    404: ")"      => (RPAREN);
                    405: "+"      => (PLUS);
                    406: {alpha}+ => (if yytext="print" then PRINT else ID yytext);
                    407: "-"      => (SUB);
                    408: "*"      => (TIMES);
                    409: .        => (error ("calc: ignoring bad character "^yytext); lex());
                    410: 
                    411: 
                    412: Here is the parser for the calculator:
                    413: 
                    414: (* Sample interactive calculator to demonstrate use of lexer produced by ML-Lex
                    415:  
                    416:    The original grammar was
                    417: 
                    418:        stmt_list -> stmt_list stmt
                    419:        stmt -> print exp ;  | exp ;
                    420:        exp -> exp + t | exp - t | t
                    421:        t -> t * f | t/f | f
                    422:        f -> (exp) | id | num
                    423: 
                    424:   The function parse takes a stream and parses it for the calculator 
                    425:   program.
                    426: 
                    427:   If a syntax error occurs, parse prints an error message and calls itself
                    428:   on the stream.  On this system that has the effect of ignoring all input
                    429:   to the end of a line.
                    430: *)
                    431:        
                    432: structure Calc =
                    433:  struct
                    434:    open CalcLex
                    435:    open UserDeclarations
                    436:    exception Error
                    437:    fun parse strm =
                    438:     let
                    439:       val say = output std_out
                    440:       val lexer = makeLexer (fn n => input strm (min(1,max(can_input strm,n))))
                    441:       val nexttok = ref (lexer())
                    442:       val advance = fn () => (nexttok := lexer(); !nexttok)
                    443:       val error = fn () => (say ("calc: syntax error on line" ^
                    444:                            (makestring(!linenum)) ^ "\n"); raise Error)
                    445:       val lookup = fn i =>
                    446:         if i = "ONE" then 1
                    447:         else if i = "TWO" then 2
                    448:         else  (say ("calc: unknown identifier '" ^ i ^ "'\n"); raise Error)
                    449:      fun STMT_LIST () =
                    450:          case !nexttok of
                    451:             EOF => ()
                    452:           | _ => (STMT(); STMT_LIST())
                    453:         
                    454:      and STMT() =
                    455:          (case !nexttok
                    456:            of EOS  => ()
                    457:             | PRINT => (advance(); say ((makestring (E():int)) ^ "\n"); ())
                    458:             | _ => (E(); ());
                    459:          case !nexttok
                    460:            of EOS => (advance())
                    461:             | _ => error())
                    462:      and E () = E' (T())
                    463:      and E' (i : int ) =
                    464:          case !nexttok of
                    465:             PLUS => (advance (); E'(i+T()))
                    466:           | SUB => (advance (); E'(i-T()))
                    467:           | RPAREN => i
                    468:           | EOF => i
                    469:           | EOS => i
                    470:           | _ => error()
                    471:      and T () =  T'(F())
                    472:      and T' i =
                    473:         case !nexttok of
                    474:             PLUS => i
                    475:           | SUB => i
                    476:           | TIMES => (advance(); T'(i*F()))
                    477:           | DIV => (advance (); T'(i div F()))
                    478:           | EOF => i
                    479:           | EOS => i
                    480:           | RPAREN => i
                    481:           | _ => error()
                    482:      and F () =
                    483:         case !nexttok of
                    484:             ID i => (advance(); lookup i)
                    485:           | LPAREN =>
                    486:               let val v = (advance(); E())
                    487:               in if !nexttok = RPAREN then (advance (); v) else error()
                    488:               end
                    489:           | NUM i => (advance(); i)
                    490:           | _ => error()
                    491:     in STMT_LIST () handle Error => parse strm
                    492:     end
                    493:  end

unix.superglobalmegacorp.com

This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.