|
|
researchv10 Norman
ML-YACC, version 1.0
Preliminary Documentation
for Preliminary Version
David R. Tarditi
Andrew W. Appel
Department of Computer Science
Princeton University
Princeton, NJ 08544
February 1, 1989
(c) 1989 Andrew W. Appel, David R. Tarditi
This software comes with ABSOLUTELY NO WARRANTY.
This software is subject only to the PRINCETON STANDARD ML SOFTWARE LIBRARY
COPYRIGHT NOTICE, LICENSE AND DISCLAIMER, (in the file "COPYRIGHT",
distributed with this software). You may copy and distribute this software;
see the COPYRIGHT NOTICE for details and restrictions.
Description
-----------
This is a preliminary guide to using ML-Yacc. It is not complete documentation.
It tells how to invoke ML-Yacc, and the syntax of an ML-Yacc specification.
The syntax description assumes the user knows how to use Yacc, and notes the
differences between ML-Yacc and Yacc.
Note that version 2.0 will be released at the end of 1989; it has a slightly
different interface and much improved error recovery.
ML-Yacc is Yacc-like parser generator for Standard ML. It generates
parsers for LALR languages, like Yacc, and its syntax is very similar to
that of Yacc.
It handles syntax errors differently from Yacc. The parser
generated by ML-Yacc will attempt to automatically recover from a
syntax error by making a single token insertion, deletion, or substitution.
At the moment, only tokens without values can be inserted or substituted.
A future version will also insert tokens with values, by providing a
mechanism for specifying code to be evaluated for each token that will
provided a dummy value.
Syntax Error Recovery Method.
-----------------------------
The method used is described in
'A Practical Method for LR and LL Syntactic Error Diagnosis and
Recovery', by M. Burke and G. Fisher, ACM Transactions on
Programming Languages and Systems, Vol. 9, No. 2, April 1987,
pp. 164-197.
The partial, deferred method discussed in the article has been
implemented.
This method defers reductions for some number of shifted tokens. The
deferred reductions are kept in a queue, and when an element is pulled
from the queue, the reductions are then applied to the value stack.
When a syntax error is encountered, the parser reads some number of tokens
ahead. It then uses the queue of deferred actions to check possible error
recoveries. IMPORTANT: Since the lexer is several tokens ahead of the
execution of semantic actions, the semantic routines CANNOT influence the
action of the lexer in any significant way. Example 1: it is hard to imagine
how to compile "typedef" in C. Example 2: the "infix" operators of ML must be
treated as ordinary identifiers and handled specially by the semantic actions.
Invoking ML-Yacc
----------------
Use "mlyacc.sml"; this will create a structure ParseGen. The function
ParseGen.parseGen creates a program for a parser from an input
specification. It takes a string argument -- the name of the file
containing the input specification. The output file name is
determined by appending ".sml" to the input file name.
Using the parser created by ML-Yacc
-----------------------------------
Invoking the parser
-------------------
After following the commands above, a structure whose name was
specified by the user will exist. Three of the components are relevent to
the user:
HDR: a structure containing values declared in the "user routines"
section of the ML-Yacc specification.
LexValue: a structure containing constructors for returning tokens
and their values to the parser from the lexical analyzer.
parse: a function which takes a lexing function and a pair of integers
and parses the input from the lexing functions.
The lexing function must returns values of type LexValue.V
The pair of integers specifies the defer and lookahead
parameters for the error-recovery algorithm.
For interactive parsers, (0,0) is suggested; for
good error recovery, (10,5) is suggested.
Creating the lex function
-------------------------
The lexing function must have the following form:
lexer : unit -> LexValue.V
This lexical-analyzer interface is exactly that provided by ML-Lex, an ML
version of Lex. The user should include the statement
'open {structure name}.LexValue'
in the user routines section of his ML-Lex specification. This will
make the components of HDR and the constructors for terminals available to
the lex actions. The lex actions should all return these constructors as
values.
The names of the constructors are the same as the terminal names used
in the ML-Yacc specification. The constructors for those terminals
that have values take only values with the same type as the terminal,
of course. Those for terminals with no values are nullary constructors.
A sample ML-Lex specification is given at the end of this document.
Tying the lexer and parser together
-----------------------------------
The use of ML-Lex is suggested but not required.
The user should create the lexing function using makeLexer, and then
pass this function to the function C.parse. Here is some code which
does this.
fun parse(infile) =
let val in_str = open_in infile
val lexer = mlex.makeLexer(input in_str)
val p = C.parse lexer (5,5)
in (close_in in_str; p)
end
Grammar specifications
----------------------
The ML-Yacc specification is very similar to a Yacc specification. The
specification has the following form:
{user routines}
%%
{declarations}
%%
{rules}
The declarations section contains the following declarations:
%term %eof %left %nonassoc %verbose %subst %default
%nonterm %start %right %structure %prefer %keyword
User routines
-------------
The user routines section must contain the following:
type Lineno = ...
val lineno : Lineno ref = ...
val error : string -> Lineno -> unit
which is used to print error messages.
The Lineno type is some representation of a position in the input file, so that
error messages can be keyed to the line number (or character position, etc.)
of the token that "caused" the error. If this is not necessary, Lineno can be
just "unit".
%verbose
--------
Generate a y.output file, which gives a verbose description of the
tables created from the grammar.
%structure
----------
The name of the structure in which to place the parser should be specifed
using the statement %structure {structure name}
Terminals and nonterminals
--------------------------
All terminals must be declared in the %term statement. All
nonterminals must be declared in the %nonterm statement. The
statements both have a form similar to that of an ML constructor
statement.
Each symbol may be followed by the phrase "of <type>".
Multiple symbols must be separated by a bar: '|'. At least one
symbol must appear in each statement.
The user is cautioned not to use ML reserved words for terminal or
nonterminal names. The program produced by ML-Yacc will not load correctly.
The <type> may be any valid ML type expression.
A terminal name for the eof terminal must also be supplied.
The start nonterminal and the eof terminal
------------------------------------------
The eof terminal must be named in the %eof statement. Suppose the
eof terminal were EOF. Then the %eof statement would be
%eof EOF
The start terminal should be named in the %start statement. If one
is not supplied, the lhs of the first rule will be used.
Precedence
----------
Precedence is specified in the same manner as yacc. The terminals
are listed after the %left, %right, or %nonassoc statement. The
statements are in order of ascending (tighter binding) precedence.
Precedence operates like it does in yacc, except for %nonassoc.
Like YACC, each rule is assigned the precedence of its rightmost terminal.
In the case of a shift/reduce conflict the precedence of the terminal
in the shift and the precedence of the rule in the reduce are compared.
If the terminal has a higher precedence, the shift is performed. If the
rule has a higher precedence, the reduce is performed. If the terminal
and the rule have the same precedence, the associativity of the terminal
is used to resolve the conflict. If the terminal is left associative,
we reduce. If the terminal is right associative, we shift. If the
terminal is nonassociative we print an error message and shift.
Thus %nonassoc does not produce a fatal error if the associativity of a
nonassociative terminal is used to resolve a conflict. A warning message is
printed, though, about this, and a shift is planted. The shift causes the
nonassociative terminal to default to right associativity.
Reduce/reduce conflicts are fatal in ML-Yacc, unlike in yacc. There is
no resolution of reduce/reduce conflicts based on rule ordering.
The %prec tag may be used to alter the precedence for a rule. It is
described below under the Rules section.
Information for an error correction algorithm.
----------------------------------------------
The following keywords allow the user to specify information that may improve
recovery:
%keyword - a list of tokens which are keywords
%prefer - a list of tokens which are preferred
for insertion
%subst - a list of preferred substitions for
certain tokens.
%default - value to be used for inserted tokens
that carry values. (not yet implemented)
%keyword and %prefer should each be followed by a list of tokens.
%subst has the following syntax:
%subst {token} for {token} | ... | {token} for {token}
Rules
-----
The rule section consists of a list of rules.
Each rule has the syntax:
{lhs nonterminal} : {rhs symbol list} ({value of same type as nonterminal,
or any type if the nonterminal has
none })
or
{lhs nonterminal} : {rhs symbol list} %prec {terminal} ( ... value ...)
The second form gives the rule the same precedence as the terminal.
The | may be used to separate multiple rhs's with the same lhs.
A null rhs may be specified by simply by having an empty rhs symbol list.
The ( ... value ... ) part is not optional. It may be empty, though, if
the lhs nonterminal has no type associated with it.
Values
------
The value for a symbol on the rhs of a rule is in a variable
{symbol}N. N is the number of occurrences of the symbol in the rhs,
up to and including the symbol. Suppose we had a specification of
the form:
%term ... PLUS ...
%nonterm ... EXP of int ...
...
%%
EXP : EXP PLUS EXP
Then the action could contain (EXP1 + EXP2). EXP1 is the value of the
first occurrence of EXP on the rhs, while EXP2 is the value of the second
occurrence on the rhs.
If a symbol has no type associated with it, it has no value associated
with it. Attempting to use PLUS1, in the above example, would result
later in a compilation error.
Any value returned by a rule whose left hand side has no type associated with
it is ignored. The ML code associated with such rules may return any kind of
value; it will be executed for possible side-effects.
If a terminal or nonterminal occurs only once on the rhs of a rule, its
value is also in {symbol}, as well as in {symbol}1
Bugs
----
There should be a better way for semantic-action routines to print out errors
and get the appropriate range of line numbers in the message automatically.
Functors should be used to make the lexer more independent of the parser.
Speed should be improved in future versions.
Sample specification
--------------------
(* sample.grm *)
type Lineno = int
val lineno = ref 1
fun error s l = output std_out
("line " ^ makestring (l:int) ^ ":" ^ s ^ "\n")
fun lookup s = ordof(s,0) - ord("a")+1
%%
%structure Calc
%eof EOF
%start START
%left SUB PLUS
%left TIMES DIV
%right CARAT
%term ID of string | NUM of int | PLUS | TIMES | PRINT | EOS | EOF |
CARAT | DIV | SUB
%nonterm EXP of int | START | STMT | STMT_LIST
%%
START : STMT_LIST ()
STMT_LIST : STMT_LIST STMT EOS ()
STMT_LIST : ()
STMT : PRINT EXP (print EXP; print "\n"; flush_out std_out)
STMT : EXP ()
EXP : NUM (NUM)
| ID (lookup ID)
| EXP PLUS EXP (EXP1+EXP2)
| EXP TIMES EXP (EXP1*EXP2)
| EXP DIV EXP (EXP1 div EXP2)
| EXP SUB EXP (EXP1-EXP2)
| EXP CARAT EXP (let fun e (m,0) = 1
| e (m,i) = m*e(m,i-1)
in e (EXP1,EXP2)
end)
Sample ML-Lex specification
---------------------------
(* sample.lex *)
open Calc.LexValue
type lexresult=V
val eof = fn () => EOF
%%
alpha=[A-Za-z];
digit=[0-9];
ws = [\ \t\n];
%%
{ws}+ => (lex());
{digit}+ => (NUM (revfold (fn (a,r) => ord(a)-ord("0")+10*r) (explode yytext) 0));
"+" => (PLUS);
"*" => (TIMES);
";" => (EOS);
{alpha}+ => (if yytext="print" then PRINT else ID yytext);
"-" => (SUB);
"^" => (CARAT);
"/" => (DIV);
. => (Calc.HDR.error ("ignoring bad character " ^ yytext); lex());
Sample "main" module
--------------------
(* sample.sml *)
structure Sample =
struct
fun run filename =
(* more suitable for non-interactive use *)
let val in_str = open_in filename
val lexer = Mlex.makeLexer (input in_str)
val p = (Calc.HDR.lineno := 0;
Calc.parse lexer (5,5))
in (close_in in_str; p)
end
fun run_std_in () =
(* more suitable for interactive use *)
let val lexer = Mlex.makeLexer (fn _ => input_line std_in)
val p = (Calc.HDR.lineno := 0;
Calc.parse lexer (0,0))
in p
end
end
Sample input
------------
(* sample.input, contains an intentional syntax error *)
print 4+2;
a+b
print b*c;
How to try the sample program
----------------------------
% sml
- use "mlyacc.sml"; (* load the parser generator *)
- use "lexgen.sml"; (* load the lexical analyzer generator *)
- open LexGen ParseGen;
- exportML "yacclex"; (* save the image for later use *)
- lexGen "sample.lex"; (* creates the file sample.lex.sml *)
- parseGen "sample.grm"; (* creates the file sample.grm.sml *)
- ^D (* exiting here is optional, of course *)
% sml
- use "sample.grm.sml"; (* compile the parser *)
- use "sample.lex.sml"; (* compile the lexer *)
- use "sample.sml"; (* compile the main module *)
- Sample.run "sample.input"; (* run the sample program *)
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.