researchv10no/cmd/sml/lib/mlyacc/mlyacc.doc - view

File: [Research Unix] / researchv10no / cmd / sml / lib / mlyacc / mlyacc.doc
Revision 1.1.1.1 (vendor branch): download - view: text, annotated - select for diffs
Tue Apr 24 17:21:34 2018 UTC (8 years, 1 month ago) by root
Branches: belllabs, MAIN
CVS tags: researchv10, HEAD

researchv10 Norman

ML-YACC, version 1.0 Preliminary Documentation for Preliminary Version David R. Tarditi Andrew W. Appel Department of Computer Science Princeton University Princeton, NJ 08544 February 1, 1989 (c) 1989 Andrew W. Appel, David R. Tarditi This software comes with ABSOLUTELY NO WARRANTY. This software is subject only to the PRINCETON STANDARD ML SOFTWARE LIBRARY COPYRIGHT NOTICE, LICENSE AND DISCLAIMER, (in the file "COPYRIGHT", distributed with this software). You may copy and distribute this software; see the COPYRIGHT NOTICE for details and restrictions. Description ----------- This is a preliminary guide to using ML-Yacc. It is not complete documentation. It tells how to invoke ML-Yacc, and the syntax of an ML-Yacc specification. The syntax description assumes the user knows how to use Yacc, and notes the differences between ML-Yacc and Yacc. Note that version 2.0 will be released at the end of 1989; it has a slightly different interface and much improved error recovery. ML-Yacc is Yacc-like parser generator for Standard ML. It generates parsers for LALR languages, like Yacc, and its syntax is very similar to that of Yacc. It handles syntax errors differently from Yacc. The parser generated by ML-Yacc will attempt to automatically recover from a syntax error by making a single token insertion, deletion, or substitution. At the moment, only tokens without values can be inserted or substituted. A future version will also insert tokens with values, by providing a mechanism for specifying code to be evaluated for each token that will provided a dummy value. Syntax Error Recovery Method. ----------------------------- The method used is described in 'A Practical Method for LR and LL Syntactic Error Diagnosis and Recovery', by M. Burke and G. Fisher, ACM Transactions on Programming Languages and Systems, Vol. 9, No. 2, April 1987, pp. 164-197. The partial, deferred method discussed in the article has been implemented. This method defers reductions for some number of shifted tokens. The deferred reductions are kept in a queue, and when an element is pulled from the queue, the reductions are then applied to the value stack. When a syntax error is encountered, the parser reads some number of tokens ahead. It then uses the queue of deferred actions to check possible error recoveries. IMPORTANT: Since the lexer is several tokens ahead of the execution of semantic actions, the semantic routines CANNOT influence the action of the lexer in any significant way. Example 1: it is hard to imagine how to compile "typedef" in C. Example 2: the "infix" operators of ML must be treated as ordinary identifiers and handled specially by the semantic actions. Invoking ML-Yacc ---------------- Use "mlyacc.sml"; this will create a structure ParseGen. The function ParseGen.parseGen creates a program for a parser from an input specification. It takes a string argument -- the name of the file containing the input specification. The output file name is determined by appending ".sml" to the input file name. Using the parser created by ML-Yacc ----------------------------------- Invoking the parser ------------------- After following the commands above, a structure whose name was specified by the user will exist. Three of the components are relevent to the user: HDR: a structure containing values declared in the "user routines" section of the ML-Yacc specification. LexValue: a structure containing constructors for returning tokens and their values to the parser from the lexical analyzer. parse: a function which takes a lexing function and a pair of integers and parses the input from the lexing functions. The lexing function must returns values of type LexValue.V The pair of integers specifies the defer and lookahead parameters for the error-recovery algorithm. For interactive parsers, (0,0) is suggested; for good error recovery, (10,5) is suggested. Creating the lex function ------------------------- The lexing function must have the following form: lexer : unit -> LexValue.V This lexical-analyzer interface is exactly that provided by ML-Lex, an ML version of Lex. The user should include the statement 'open {structure name}.LexValue' in the user routines section of his ML-Lex specification. This will make the components of HDR and the constructors for terminals available to the lex actions. The lex actions should all return these constructors as values. The names of the constructors are the same as the terminal names used in the ML-Yacc specification. The constructors for those terminals that have values take only values with the same type as the terminal, of course. Those for terminals with no values are nullary constructors. A sample ML-Lex specification is given at the end of this document. Tying the lexer and parser together ----------------------------------- The use of ML-Lex is suggested but not required. The user should create the lexing function using makeLexer, and then pass this function to the function C.parse. Here is some code which does this. fun parse(infile) = let val in_str = open_in infile val lexer = mlex.makeLexer(input in_str) val p = C.parse lexer (5,5) in (close_in in_str; p) end Grammar specifications ---------------------- The ML-Yacc specification is very similar to a Yacc specification. The specification has the following form: {user routines} %% {declarations} %% {rules} The declarations section contains the following declarations: %term %eof %left %nonassoc %verbose %subst %default %nonterm %start %right %structure %prefer %keyword User routines ------------- The user routines section must contain the following: type Lineno = ... val lineno : Lineno ref = ... val error : string -> Lineno -> unit which is used to print error messages. The Lineno type is some representation of a position in the input file, so that error messages can be keyed to the line number (or character position, etc.) of the token that "caused" the error. If this is not necessary, Lineno can be just "unit". %verbose -------- Generate a y.output file, which gives a verbose description of the tables created from the grammar. %structure ---------- The name of the structure in which to place the parser should be specifed using the statement %structure {structure name} Terminals and nonterminals -------------------------- All terminals must be declared in the %term statement. All nonterminals must be declared in the %nonterm statement. The statements both have a form similar to that of an ML constructor statement. Each symbol may be followed by the phrase "of <type>". Multiple symbols must be separated by a bar: '|'. At least one symbol must appear in each statement. The user is cautioned not to use ML reserved words for terminal or nonterminal names. The program produced by ML-Yacc will not load correctly. The <type> may be any valid ML type expression. A terminal name for the eof terminal must also be supplied. The start nonterminal and the eof terminal ------------------------------------------ The eof terminal must be named in the %eof statement. Suppose the eof terminal were EOF. Then the %eof statement would be %eof EOF The start terminal should be named in the %start statement. If one is not supplied, the lhs of the first rule will be used. Precedence ---------- Precedence is specified in the same manner as yacc. The terminals are listed after the %left, %right, or %nonassoc statement. The statements are in order of ascending (tighter binding) precedence. Precedence operates like it does in yacc, except for %nonassoc. Like YACC, each rule is assigned the precedence of its rightmost terminal. In the case of a shift/reduce conflict the precedence of the terminal in the shift and the precedence of the rule in the reduce are compared. If the terminal has a higher precedence, the shift is performed. If the rule has a higher precedence, the reduce is performed. If the terminal and the rule have the same precedence, the associativity of the terminal is used to resolve the conflict. If the terminal is left associative, we reduce. If the terminal is right associative, we shift. If the terminal is nonassociative we print an error message and shift. Thus %nonassoc does not produce a fatal error if the associativity of a nonassociative terminal is used to resolve a conflict. A warning message is printed, though, about this, and a shift is planted. The shift causes the nonassociative terminal to default to right associativity. Reduce/reduce conflicts are fatal in ML-Yacc, unlike in yacc. There is no resolution of reduce/reduce conflicts based on rule ordering. The %prec tag may be used to alter the precedence for a rule. It is described below under the Rules section. Information for an error correction algorithm. ---------------------------------------------- The following keywords allow the user to specify information that may improve recovery: %keyword - a list of tokens which are keywords %prefer - a list of tokens which are preferred for insertion %subst - a list of preferred substitions for certain tokens. %default - value to be used for inserted tokens that carry values. (not yet implemented) %keyword and %prefer should each be followed by a list of tokens. %subst has the following syntax: %subst {token} for {token} | ... | {token} for {token} Rules ----- The rule section consists of a list of rules. Each rule has the syntax: {lhs nonterminal} : {rhs symbol list} ({value of same type as nonterminal, or any type if the nonterminal has none }) or {lhs nonterminal} : {rhs symbol list} %prec {terminal} ( ... value ...) The second form gives the rule the same precedence as the terminal. The | may be used to separate multiple rhs's with the same lhs. A null rhs may be specified by simply by having an empty rhs symbol list. The ( ... value ... ) part is not optional. It may be empty, though, if the lhs nonterminal has no type associated with it. Values ------ The value for a symbol on the rhs of a rule is in a variable {symbol}N. N is the number of occurrences of the symbol in the rhs, up to and including the symbol. Suppose we had a specification of the form: %term ... PLUS ... %nonterm ... EXP of int ... ... %% EXP : EXP PLUS EXP Then the action could contain (EXP1 + EXP2). EXP1 is the value of the first occurrence of EXP on the rhs, while EXP2 is the value of the second occurrence on the rhs. If a symbol has no type associated with it, it has no value associated with it. Attempting to use PLUS1, in the above example, would result later in a compilation error. Any value returned by a rule whose left hand side has no type associated with it is ignored. The ML code associated with such rules may return any kind of value; it will be executed for possible side-effects. If a terminal or nonterminal occurs only once on the rhs of a rule, its value is also in {symbol}, as well as in {symbol}1 Bugs ---- There should be a better way for semantic-action routines to print out errors and get the appropriate range of line numbers in the message automatically. Functors should be used to make the lexer more independent of the parser. Speed should be improved in future versions. Sample specification -------------------- (* sample.grm *) type Lineno = int val lineno = ref 1 fun error s l = output std_out ("line " ^ makestring (l:int) ^ ":" ^ s ^ "\n") fun lookup s = ordof(s,0) - ord("a")+1 %% %structure Calc %eof EOF %start START %left SUB PLUS %left TIMES DIV %right CARAT %term ID of string | NUM of int | PLUS | TIMES | PRINT | EOS | EOF | CARAT | DIV | SUB %nonterm EXP of int | START | STMT | STMT_LIST %% START : STMT_LIST () STMT_LIST : STMT_LIST STMT EOS () STMT_LIST : () STMT : PRINT EXP (print EXP; print "\n"; flush_out std_out) STMT : EXP () EXP : NUM (NUM) | ID (lookup ID) | EXP PLUS EXP (EXP1+EXP2) | EXP TIMES EXP (EXP1*EXP2) | EXP DIV EXP (EXP1 div EXP2) | EXP SUB EXP (EXP1-EXP2) | EXP CARAT EXP (let fun e (m,0) = 1 | e (m,i) = m*e(m,i-1) in e (EXP1,EXP2) end) Sample ML-Lex specification --------------------------- (* sample.lex *) open Calc.LexValue type lexresult=V val eof = fn () => EOF %% alpha=[A-Za-z]; digit=[0-9]; ws = [\ \t\n]; %% {ws}+ => (lex()); {digit}+ => (NUM (revfold (fn (a,r) => ord(a)-ord("0")+10*r) (explode yytext) 0)); "+" => (PLUS); "*" => (TIMES); ";" => (EOS); {alpha}+ => (if yytext="print" then PRINT else ID yytext); "-" => (SUB); "^" => (CARAT); "/" => (DIV); . => (Calc.HDR.error ("ignoring bad character " ^ yytext); lex()); Sample "main" module -------------------- (* sample.sml *) structure Sample = struct fun run filename = (* more suitable for non-interactive use *) let val in_str = open_in filename val lexer = Mlex.makeLexer (input in_str) val p = (Calc.HDR.lineno := 0; Calc.parse lexer (5,5)) in (close_in in_str; p) end fun run_std_in () = (* more suitable for interactive use *) let val lexer = Mlex.makeLexer (fn _ => input_line std_in) val p = (Calc.HDR.lineno := 0; Calc.parse lexer (0,0)) in p end end Sample input ------------ (* sample.input, contains an intentional syntax error *) print 4+2; a+b print b*c; How to try the sample program ---------------------------- % sml - use "mlyacc.sml"; (* load the parser generator *) - use "lexgen.sml"; (* load the lexical analyzer generator *) - open LexGen ParseGen; - exportML "yacclex"; (* save the image for later use *) - lexGen "sample.lex"; (* creates the file sample.lex.sml *) - parseGen "sample.grm"; (* creates the file sample.grm.sml *) - ^D (* exiting here is optional, of course *) % sml - use "sample.grm.sml"; (* compile the parser *) - use "sample.lex.sml"; (* compile the lexer *) - use "sample.sml"; (* compile the main module *) - Sample.run "sample.input"; (* run the sample program *)

This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.