researchv10no/cmd/sml/lib/mlyacc/mlyacc.doc - annotate

Return to mlyacc.doc CVS log
Up to [Research Unix] / researchv10no / cmd / sml / lib / mlyacc
Annotation of researchv10no/cmd/sml/lib/mlyacc/mlyacc.doc, revision 1.1.1.1

1.1       root        1: 
                      2:                         ML-YACC, version 1.0
                      3: 
                      4:                         Preliminary Documentation
                      5:                         for Preliminary Version
                      6: 
                      7:                         David R. Tarditi
                      8:                         Andrew W. Appel
                      9: 
                     10:                         Department of Computer Science
                     11:                         Princeton University
                     12:                         Princeton, NJ 08544
                     13: 
                     14:                         February 1, 1989
                     15: 
                     16: (c) 1989 Andrew W. Appel, David R. Tarditi
                     17: This software comes with ABSOLUTELY NO WARRANTY.
                     18: This software is subject only to the PRINCETON STANDARD ML SOFTWARE LIBRARY
                     19: COPYRIGHT NOTICE, LICENSE AND DISCLAIMER, (in the file "COPYRIGHT",
                     20: distributed with this software). You may copy and distribute this software;
                     21: see the COPYRIGHT NOTICE for details and restrictions.
                     22: 
                     23: Description
                     24: -----------
                     25: 
                     26: This is a preliminary guide to using ML-Yacc. It is not complete documentation.
                     27: It tells how to invoke ML-Yacc, and the syntax of an ML-Yacc specification.
                     28: The syntax description assumes the user knows how to use Yacc, and notes the
                     29: differences between ML-Yacc and Yacc.
                     30: 
                     31: Note that version 2.0 will be released at the end of 1989; it has a slightly
                     32: different interface and much improved error recovery.
                     33: 
                     34: ML-Yacc is Yacc-like parser generator for Standard ML.  It generates
                     35: parsers for LALR languages, like Yacc, and its syntax is very similar to
                     36: that of Yacc.
                     37: 
                     38: It handles syntax errors differently from Yacc.  The parser
                     39: generated by ML-Yacc will attempt to automatically recover from a
                     40: syntax error by making a single token insertion, deletion, or substitution.
                     41: At the moment, only tokens without values can be inserted or substituted.
                     42: A future version will also insert tokens with values, by providing a
                     43: mechanism for specifying code to be evaluated for each token that will
                     44: provided a dummy value.
                     45: 
                     46: Syntax Error Recovery Method.
                     47: -----------------------------
                     48: 
                     49: The method used is described in 
                     50: 
                     51:         'A Practical Method for LR and LL Syntactic Error Diagnosis and
                     52:          Recovery', by M. Burke and G. Fisher, ACM Transactions on
                     53:          Programming Languages and Systems, Vol. 9, No. 2, April 1987,
                     54:          pp. 164-197.
                     55: 
                     56: The partial, deferred method discussed in the article has been
                     57: implemented.  
                     58: 
                     59: This method defers reductions for some number of shifted tokens.  The
                     60: deferred reductions are kept in a queue, and when an element is pulled
                     61: from the queue, the reductions are then applied to the value stack.
                     62: 
                     63: When a syntax error is encountered, the parser reads some number of tokens
                     64: ahead.  It then uses the queue of deferred actions to check possible error
                     65: recoveries.  IMPORTANT:  Since the lexer is several tokens ahead of the
                     66: execution of semantic actions, the semantic routines CANNOT influence the
                     67: action of the lexer in any significant way.  Example 1:  it is hard to imagine
                     68: how to compile "typedef" in C.  Example 2:  the "infix" operators of ML must be
                     69: treated as ordinary identifiers and handled specially by the semantic actions.
                     70: 
                     71: 
                     72: Invoking ML-Yacc
                     73: ----------------
                     74: 
                     75: Use "mlyacc.sml"; this will create a structure ParseGen.  The function
                     76: ParseGen.parseGen creates a program for a parser from an input
                     77: specification.  It takes a string argument -- the name of the file
                     78: containing the input specification.  The output file name is 
                     79: determined by appending ".sml" to the input file name.
                     80: 
                     81: Using the parser created by ML-Yacc
                     82: -----------------------------------
                     83: 
                     84: Invoking the parser
                     85: -------------------
                     86: 
                     87: After following the commands above, a structure whose name was
                     88: specified by the user will exist.  Three of the components are relevent to
                     89: the user:
                     90: 
                     91:         HDR: a structure containing values declared in the "user routines"
                     92:              section of the ML-Yacc specification.
                     93: 
                     94:         LexValue: a structure containing constructors for returning tokens
                     95:                   and their values to the parser from the lexical analyzer.
                     96: 
                     97:         parse: a function which takes a lexing function and a pair of integers
                     98:                and parses the input from the lexing functions.
                     99:                The lexing function must returns values of type LexValue.V
                    100:                The pair of integers specifies the defer and lookahead
                    101:                parameters for the error-recovery algorithm.
                    102:                For interactive parsers, (0,0) is suggested; for
                    103:               good error recovery, (10,5) is suggested.
                    104: 
                    105: Creating the lex function
                    106: -------------------------
                    107: 
                    108: The lexing function must have the following form:
                    109: 
                    110: lexer : unit -> LexValue.V
                    111: 
                    112: This lexical-analyzer interface is exactly that provided by ML-Lex, an ML
                    113: version of Lex.  The user should include the statement 
                    114: 'open {structure name}.LexValue'
                    115: in the user routines section of his ML-Lex specification.  This will
                    116: make the components of HDR and the constructors for terminals available to
                    117: the lex actions.  The lex actions should all return these constructors as
                    118: values.
                    119: 
                    120: The names of the constructors are the same as the terminal names used
                    121: in the ML-Yacc specification.   The constructors for those terminals
                    122: that have values take only values with the same type as the terminal,
                    123: of course.  Those for terminals with no values are nullary constructors.
                    124: 
                    125: A sample ML-Lex specification is given at the end of this document.
                    126: 
                    127: Tying the lexer and parser together
                    128: -----------------------------------
                    129: 
                    130: The use of ML-Lex is suggested but not required.
                    131: The user should create the lexing function using makeLexer, and then
                    132: pass this function to the function C.parse.   Here is some code which
                    133: does this.
                    134: 
                    135: fun parse(infile) = 
                    136:      let val in_str = open_in infile
                    137:          val lexer = mlex.makeLexer(input in_str)
                    138:          val p = C.parse lexer (5,5)
                    139:      in (close_in in_str; p)
                    140:      end
                    141: 
                    142: Grammar specifications
                    143: ----------------------
                    144: 
                    145: The ML-Yacc specification is very similar to a Yacc specification.  The
                    146: specification has the following form:
                    147: 
                    148: {user routines}
                    149: %%
                    150: {declarations}
                    151: %%
                    152: {rules}
                    153: 
                    154: The declarations section contains the following declarations:
                    155: 
                    156: %term           %eof    %left   %nonassoc   %verbose    %subst      %default
                    157: %nonterm        %start  %right  %structure  %prefer     %keyword
                    158: 
                    159: User routines
                    160: -------------
                    161: 
                    162: The user routines section must contain the following:
                    163: 
                    164:         type Lineno = ...
                    165:         val  lineno : Lineno ref = ...
                    166:         val error :  string -> Lineno -> unit
                    167:              which is used to print error messages.
                    168: 
                    169: The Lineno type is some representation of a position in the input file, so that
                    170: error messages can be keyed to the line number (or character position, etc.)
                    171: of the token that "caused" the error.  If this is not necessary, Lineno can be
                    172: just "unit".
                    173: 
                    174: %verbose
                    175: --------
                    176: 
                    177: Generate a y.output file, which gives a verbose description of the
                    178: tables created from the grammar.
                    179: 
                    180: %structure
                    181: ----------
                    182: 
                    183: The name of the structure in which to place the parser should be specifed
                    184: using the statement %structure {structure name}
                    185: 
                    186: Terminals and nonterminals
                    187: --------------------------
                    188: 
                    189: All terminals must be declared in the %term statement.  All
                    190: nonterminals must be declared in the %nonterm statement.  The
                    191: statements both have a form similar to that of an ML constructor
                    192: statement.
                    193: 
                    194: Each symbol may be followed by the phrase "of <type>".
                    195: Multiple symbols must be separated by a bar: '|'.  At least one
                    196: symbol must appear in each statement.
                    197: 
                    198: The user is cautioned not to use ML reserved words for terminal or
                    199: nonterminal names.  The program produced by ML-Yacc will not load correctly.
                    200: The <type> may be any valid ML type expression.
                    201: 
                    202: A terminal name for the eof terminal must also be supplied.
                    203: 
                    204: The start nonterminal and the eof terminal
                    205: ------------------------------------------
                    206: 
                    207: The eof terminal must be named in the %eof statement.  Suppose the
                    208: eof terminal were EOF.  Then the %eof statement would be
                    209: 
                    210: %eof EOF
                    211: 
                    212: The start terminal should be named in the %start statement.  If one
                    213: is not supplied, the lhs of the first rule will be used.
                    214: 
                    215: Precedence
                    216: ----------
                    217: 
                    218: Precedence is specified in the same manner as yacc.  The terminals
                    219: are listed after the %left, %right, or %nonassoc statement.  The
                    220: statements are in order of ascending (tighter binding) precedence.
                    221: 
                    222: Precedence operates like it does in yacc, except for %nonassoc.
                    223: Like YACC, each rule is assigned the precedence of its rightmost terminal.
                    224: In the case of a shift/reduce conflict the precedence of the terminal
                    225: in the shift and the precedence of the rule in the reduce are compared.
                    226: If the terminal has a higher precedence, the shift is performed.  If the
                    227: rule has a higher precedence, the reduce is performed.  If the terminal
                    228: and the rule have the same precedence, the associativity of the terminal
                    229: is used to resolve the conflict.  If the terminal is left associative,
                    230: we reduce.  If the terminal is right associative, we shift.  If the
                    231: terminal is nonassociative we print an error message and shift.
                    232: 
                    233: Thus %nonassoc does not produce a fatal error if the associativity of a
                    234: nonassociative terminal is used to resolve a conflict.  A warning message is
                    235: printed, though, about this, and a shift is planted.  The shift causes the
                    236: nonassociative terminal to default to right associativity.
                    237: 
                    238: Reduce/reduce conflicts are fatal in ML-Yacc, unlike in yacc.  There is
                    239: no resolution of reduce/reduce conflicts based on rule ordering.
                    240: 
                    241: The %prec tag may be used to alter the precedence for a rule.  It is
                    242: described below under the Rules section.
                    243: 
                    244: Information for an error correction algorithm.
                    245: ----------------------------------------------
                    246: 
                    247: The following keywords allow the user to specify information that may improve
                    248: recovery:
                    249: 
                    250:         %keyword                - a list of tokens which are keywords
                    251:         %prefer                 - a list of tokens which are preferred
                    252:                                   for insertion
                    253:         %subst                  - a list of preferred substitions for
                    254:                                   certain tokens.
                    255:         %default                - value to be used for inserted tokens
                    256:                                     that carry values. (not yet implemented)
                    257: 
                    258: %keyword and %prefer should each be followed by a list of tokens.
                    259: 
                    260: %subst has the following syntax: 
                    261: 
                    262:    %subst {token} for {token} |  ...   | {token} for {token}
                    263: 
                    264: Rules
                    265: -----
                    266: 
                    267: The rule section consists of a list of rules.
                    268: 
                    269: Each rule has the syntax:
                    270: 
                    271: {lhs nonterminal} : {rhs symbol list}   ({value of same type as nonterminal,
                    272:                                          or any type if the nonterminal has
                    273:                                          none }) 
                    274: 
                    275:  or
                    276: 
                    277:  {lhs nonterminal} : {rhs symbol list} %prec {terminal} ( ... value ...)
                    278: 
                    279: The second form gives the rule the same precedence as the terminal.
                    280: 
                    281: The | may be used to separate multiple rhs's with the same lhs.
                    282: 
                    283: A null rhs may be specified by simply by having an empty rhs symbol list.
                    284: 
                    285: The ( ... value ... ) part is not optional.  It may be empty, though, if
                    286: the lhs nonterminal has no type associated with it.
                    287: 
                    288: Values
                    289: ------
                    290: 
                    291: The value for a symbol on the rhs of a rule is in a variable
                    292: {symbol}N.  N is the number of occurrences of the symbol in the rhs,
                    293: up to and including the symbol.  Suppose we had a specification of
                    294: the form:
                    295: 
                    296: %term ... PLUS ...
                    297: %nonterm ... EXP of int ...
                    298: ...
                    299: %%
                    300: 
                    301: EXP : EXP PLUS EXP 
                    302: 
                    303: Then the action could contain (EXP1 + EXP2).  EXP1 is the value of the
                    304: first occurrence of EXP on the rhs, while EXP2 is the value of the second
                    305: occurrence on the rhs.
                    306: 
                    307: If a symbol has no type associated with it, it has no value associated
                    308: with it.  Attempting to use PLUS1, in the above example, would result
                    309: later in a compilation error.
                    310: 
                    311: Any value returned by a rule whose left hand side has no type associated with
                    312: it is ignored.  The ML code associated with such rules may return any kind of
                    313: value; it will be executed for possible side-effects.
                    314: 
                    315: If a terminal or nonterminal occurs only once on the rhs of a rule, its
                    316: value is also in {symbol}, as well as in {symbol}1
                    317: 
                    318: Bugs
                    319: ----
                    320: There should be a better way for semantic-action routines to print out errors
                    321: and get the appropriate range of line numbers in the message automatically.
                    322: 
                    323: Functors should be used to make the lexer more independent of the parser.
                    324: Speed should be improved in future versions.
                    325: 
                    326: Sample specification
                    327: --------------------
                    328: (* sample.grm *)
                    329: type Lineno = int
                    330: val lineno = ref 1
                    331: fun error s l = output std_out 
                    332:         ("line " ^ makestring (l:int) ^ ":" ^ s ^ "\n")
                    333: fun lookup s = ordof(s,0) - ord("a")+1
                    334: %%
                    335: %structure Calc
                    336: %eof   EOF
                    337: %start START
                    338: 
                    339: %left SUB PLUS
                    340: %left TIMES DIV
                    341: %right CARAT
                    342: 
                    343: %term ID of string | NUM of int | PLUS | TIMES | PRINT | EOS | EOF |
                    344:         CARAT | DIV | SUB
                    345: %nonterm EXP of int | START | STMT | STMT_LIST
                    346: 
                    347: %%
                    348:   START : STMT_LIST     ()
                    349:   STMT_LIST : STMT_LIST STMT EOS ()
                    350:   STMT_LIST : ()
                    351:   STMT : PRINT EXP      (print EXP; print "\n"; flush_out std_out)
                    352:   STMT : EXP            ()
                    353:   EXP : NUM             (NUM)
                    354:       | ID              (lookup ID)
                    355:       | EXP PLUS EXP    (EXP1+EXP2)
                    356:       | EXP TIMES EXP   (EXP1*EXP2)
                    357:       | EXP DIV EXP     (EXP1 div EXP2)
                    358:       | EXP SUB EXP     (EXP1-EXP2)
                    359:       | EXP CARAT EXP   (let fun e (m,0) = 1
                    360:                                 | e (m,i) = m*e(m,i-1)
                    361:                          in e (EXP1,EXP2)       
                    362:                          end)
                    363: 
                    364: Sample ML-Lex specification
                    365: ---------------------------
                    366: (* sample.lex *)
                    367: open Calc.LexValue
                    368: type lexresult=V
                    369: val eof = fn () => EOF
                    370: %%
                    371: alpha=[A-Za-z];
                    372: digit=[0-9];
                    373: ws = [\ \t\n];
                    374: %%
                    375: {ws}+    => (lex());
                    376: {digit}+ => (NUM (revfold (fn (a,r) => ord(a)-ord("0")+10*r) (explode yytext) 0));
                    377: "+"      => (PLUS);
                    378: "*"      => (TIMES);
                    379: ";"      => (EOS);
                    380: {alpha}+ => (if yytext="print" then PRINT else ID yytext);
                    381: "-"      => (SUB);
                    382: "^"      => (CARAT);
                    383: "/"      => (DIV);
                    384: .        => (Calc.HDR.error ("ignoring bad character " ^ yytext); lex());
                    385: 
                    386: 
                    387: Sample "main" module
                    388: --------------------
                    389: (* sample.sml *)
                    390: structure Sample =
                    391:    struct
                    392:       fun run filename = 
                    393:             (* more suitable for non-interactive use *)
                    394:           let val in_str = open_in filename
                    395:               val lexer =  Mlex.makeLexer (input in_str)
                    396:               val p = (Calc.HDR.lineno := 0;
                    397:                        Calc.parse lexer (5,5))
                    398:            in (close_in in_str; p)
                    399:            end
                    400:     
                    401:       fun run_std_in () =
                    402:             (* more suitable for interactive use *)
                    403:         let val lexer = Mlex.makeLexer (fn _ => input_line std_in)
                    404:             val p = (Calc.HDR.lineno := 0;
                    405:                        Calc.parse lexer (0,0))
                    406:          in p
                    407:         end
                    408:    end
                    409: 
                    410: Sample input
                    411: ------------
                    412: (* sample.input, contains an intentional syntax error *)
                    413: print 4+2;
                    414: a+b
                    415: print b*c;
                    416: 
                    417: 
                    418: How to try the sample program
                    419: ----------------------------
                    420: 
                    421: % sml
                    422: - use "mlyacc.sml";  (* load the parser generator *)
                    423: - use "lexgen.sml";  (* load the lexical analyzer generator *)
                    424: - open LexGen ParseGen;
                    425: - exportML "yacclex"; (* save the image for later use *)
                    426: 
                    427: - lexGen "sample.lex";  (* creates the file sample.lex.sml *)
                    428: - parseGen "sample.grm"; (* creates the file sample.grm.sml *)
                    429: - ^D            (* exiting here is optional, of course *)
                    430: 
                    431: % sml
                    432: - use "sample.grm.sml";     (* compile the parser *)
                    433: - use "sample.lex.sml";     (* compile the lexer *)
                    434: - use "sample.sml";         (* compile the main module *)
                    435: - Sample.run "sample.input"; (* run the sample program *)
                    436:
unix.superglobalmegacorp.com
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.