|
|
1.1 root 1:
2: ML-YACC, version 1.0
3:
4: Preliminary Documentation
5: for Preliminary Version
6:
7: David R. Tarditi
8: Andrew W. Appel
9:
10: Department of Computer Science
11: Princeton University
12: Princeton, NJ 08544
13:
14: February 1, 1989
15:
16: (c) 1989 Andrew W. Appel, David R. Tarditi
17: This software comes with ABSOLUTELY NO WARRANTY.
18: This software is subject only to the PRINCETON STANDARD ML SOFTWARE LIBRARY
19: COPYRIGHT NOTICE, LICENSE AND DISCLAIMER, (in the file "COPYRIGHT",
20: distributed with this software). You may copy and distribute this software;
21: see the COPYRIGHT NOTICE for details and restrictions.
22:
23: Description
24: -----------
25:
26: This is a preliminary guide to using ML-Yacc. It is not complete documentation.
27: It tells how to invoke ML-Yacc, and the syntax of an ML-Yacc specification.
28: The syntax description assumes the user knows how to use Yacc, and notes the
29: differences between ML-Yacc and Yacc.
30:
31: Note that version 2.0 will be released at the end of 1989; it has a slightly
32: different interface and much improved error recovery.
33:
34: ML-Yacc is Yacc-like parser generator for Standard ML. It generates
35: parsers for LALR languages, like Yacc, and its syntax is very similar to
36: that of Yacc.
37:
38: It handles syntax errors differently from Yacc. The parser
39: generated by ML-Yacc will attempt to automatically recover from a
40: syntax error by making a single token insertion, deletion, or substitution.
41: At the moment, only tokens without values can be inserted or substituted.
42: A future version will also insert tokens with values, by providing a
43: mechanism for specifying code to be evaluated for each token that will
44: provided a dummy value.
45:
46: Syntax Error Recovery Method.
47: -----------------------------
48:
49: The method used is described in
50:
51: 'A Practical Method for LR and LL Syntactic Error Diagnosis and
52: Recovery', by M. Burke and G. Fisher, ACM Transactions on
53: Programming Languages and Systems, Vol. 9, No. 2, April 1987,
54: pp. 164-197.
55:
56: The partial, deferred method discussed in the article has been
57: implemented.
58:
59: This method defers reductions for some number of shifted tokens. The
60: deferred reductions are kept in a queue, and when an element is pulled
61: from the queue, the reductions are then applied to the value stack.
62:
63: When a syntax error is encountered, the parser reads some number of tokens
64: ahead. It then uses the queue of deferred actions to check possible error
65: recoveries. IMPORTANT: Since the lexer is several tokens ahead of the
66: execution of semantic actions, the semantic routines CANNOT influence the
67: action of the lexer in any significant way. Example 1: it is hard to imagine
68: how to compile "typedef" in C. Example 2: the "infix" operators of ML must be
69: treated as ordinary identifiers and handled specially by the semantic actions.
70:
71:
72: Invoking ML-Yacc
73: ----------------
74:
75: Use "mlyacc.sml"; this will create a structure ParseGen. The function
76: ParseGen.parseGen creates a program for a parser from an input
77: specification. It takes a string argument -- the name of the file
78: containing the input specification. The output file name is
79: determined by appending ".sml" to the input file name.
80:
81: Using the parser created by ML-Yacc
82: -----------------------------------
83:
84: Invoking the parser
85: -------------------
86:
87: After following the commands above, a structure whose name was
88: specified by the user will exist. Three of the components are relevent to
89: the user:
90:
91: HDR: a structure containing values declared in the "user routines"
92: section of the ML-Yacc specification.
93:
94: LexValue: a structure containing constructors for returning tokens
95: and their values to the parser from the lexical analyzer.
96:
97: parse: a function which takes a lexing function and a pair of integers
98: and parses the input from the lexing functions.
99: The lexing function must returns values of type LexValue.V
100: The pair of integers specifies the defer and lookahead
101: parameters for the error-recovery algorithm.
102: For interactive parsers, (0,0) is suggested; for
103: good error recovery, (10,5) is suggested.
104:
105: Creating the lex function
106: -------------------------
107:
108: The lexing function must have the following form:
109:
110: lexer : unit -> LexValue.V
111:
112: This lexical-analyzer interface is exactly that provided by ML-Lex, an ML
113: version of Lex. The user should include the statement
114: 'open {structure name}.LexValue'
115: in the user routines section of his ML-Lex specification. This will
116: make the components of HDR and the constructors for terminals available to
117: the lex actions. The lex actions should all return these constructors as
118: values.
119:
120: The names of the constructors are the same as the terminal names used
121: in the ML-Yacc specification. The constructors for those terminals
122: that have values take only values with the same type as the terminal,
123: of course. Those for terminals with no values are nullary constructors.
124:
125: A sample ML-Lex specification is given at the end of this document.
126:
127: Tying the lexer and parser together
128: -----------------------------------
129:
130: The use of ML-Lex is suggested but not required.
131: The user should create the lexing function using makeLexer, and then
132: pass this function to the function C.parse. Here is some code which
133: does this.
134:
135: fun parse(infile) =
136: let val in_str = open_in infile
137: val lexer = mlex.makeLexer(input in_str)
138: val p = C.parse lexer (5,5)
139: in (close_in in_str; p)
140: end
141:
142: Grammar specifications
143: ----------------------
144:
145: The ML-Yacc specification is very similar to a Yacc specification. The
146: specification has the following form:
147:
148: {user routines}
149: %%
150: {declarations}
151: %%
152: {rules}
153:
154: The declarations section contains the following declarations:
155:
156: %term %eof %left %nonassoc %verbose %subst %default
157: %nonterm %start %right %structure %prefer %keyword
158:
159: User routines
160: -------------
161:
162: The user routines section must contain the following:
163:
164: type Lineno = ...
165: val lineno : Lineno ref = ...
166: val error : string -> Lineno -> unit
167: which is used to print error messages.
168:
169: The Lineno type is some representation of a position in the input file, so that
170: error messages can be keyed to the line number (or character position, etc.)
171: of the token that "caused" the error. If this is not necessary, Lineno can be
172: just "unit".
173:
174: %verbose
175: --------
176:
177: Generate a y.output file, which gives a verbose description of the
178: tables created from the grammar.
179:
180: %structure
181: ----------
182:
183: The name of the structure in which to place the parser should be specifed
184: using the statement %structure {structure name}
185:
186: Terminals and nonterminals
187: --------------------------
188:
189: All terminals must be declared in the %term statement. All
190: nonterminals must be declared in the %nonterm statement. The
191: statements both have a form similar to that of an ML constructor
192: statement.
193:
194: Each symbol may be followed by the phrase "of <type>".
195: Multiple symbols must be separated by a bar: '|'. At least one
196: symbol must appear in each statement.
197:
198: The user is cautioned not to use ML reserved words for terminal or
199: nonterminal names. The program produced by ML-Yacc will not load correctly.
200: The <type> may be any valid ML type expression.
201:
202: A terminal name for the eof terminal must also be supplied.
203:
204: The start nonterminal and the eof terminal
205: ------------------------------------------
206:
207: The eof terminal must be named in the %eof statement. Suppose the
208: eof terminal were EOF. Then the %eof statement would be
209:
210: %eof EOF
211:
212: The start terminal should be named in the %start statement. If one
213: is not supplied, the lhs of the first rule will be used.
214:
215: Precedence
216: ----------
217:
218: Precedence is specified in the same manner as yacc. The terminals
219: are listed after the %left, %right, or %nonassoc statement. The
220: statements are in order of ascending (tighter binding) precedence.
221:
222: Precedence operates like it does in yacc, except for %nonassoc.
223: Like YACC, each rule is assigned the precedence of its rightmost terminal.
224: In the case of a shift/reduce conflict the precedence of the terminal
225: in the shift and the precedence of the rule in the reduce are compared.
226: If the terminal has a higher precedence, the shift is performed. If the
227: rule has a higher precedence, the reduce is performed. If the terminal
228: and the rule have the same precedence, the associativity of the terminal
229: is used to resolve the conflict. If the terminal is left associative,
230: we reduce. If the terminal is right associative, we shift. If the
231: terminal is nonassociative we print an error message and shift.
232:
233: Thus %nonassoc does not produce a fatal error if the associativity of a
234: nonassociative terminal is used to resolve a conflict. A warning message is
235: printed, though, about this, and a shift is planted. The shift causes the
236: nonassociative terminal to default to right associativity.
237:
238: Reduce/reduce conflicts are fatal in ML-Yacc, unlike in yacc. There is
239: no resolution of reduce/reduce conflicts based on rule ordering.
240:
241: The %prec tag may be used to alter the precedence for a rule. It is
242: described below under the Rules section.
243:
244: Information for an error correction algorithm.
245: ----------------------------------------------
246:
247: The following keywords allow the user to specify information that may improve
248: recovery:
249:
250: %keyword - a list of tokens which are keywords
251: %prefer - a list of tokens which are preferred
252: for insertion
253: %subst - a list of preferred substitions for
254: certain tokens.
255: %default - value to be used for inserted tokens
256: that carry values. (not yet implemented)
257:
258: %keyword and %prefer should each be followed by a list of tokens.
259:
260: %subst has the following syntax:
261:
262: %subst {token} for {token} | ... | {token} for {token}
263:
264: Rules
265: -----
266:
267: The rule section consists of a list of rules.
268:
269: Each rule has the syntax:
270:
271: {lhs nonterminal} : {rhs symbol list} ({value of same type as nonterminal,
272: or any type if the nonterminal has
273: none })
274:
275: or
276:
277: {lhs nonterminal} : {rhs symbol list} %prec {terminal} ( ... value ...)
278:
279: The second form gives the rule the same precedence as the terminal.
280:
281: The | may be used to separate multiple rhs's with the same lhs.
282:
283: A null rhs may be specified by simply by having an empty rhs symbol list.
284:
285: The ( ... value ... ) part is not optional. It may be empty, though, if
286: the lhs nonterminal has no type associated with it.
287:
288: Values
289: ------
290:
291: The value for a symbol on the rhs of a rule is in a variable
292: {symbol}N. N is the number of occurrences of the symbol in the rhs,
293: up to and including the symbol. Suppose we had a specification of
294: the form:
295:
296: %term ... PLUS ...
297: %nonterm ... EXP of int ...
298: ...
299: %%
300:
301: EXP : EXP PLUS EXP
302:
303: Then the action could contain (EXP1 + EXP2). EXP1 is the value of the
304: first occurrence of EXP on the rhs, while EXP2 is the value of the second
305: occurrence on the rhs.
306:
307: If a symbol has no type associated with it, it has no value associated
308: with it. Attempting to use PLUS1, in the above example, would result
309: later in a compilation error.
310:
311: Any value returned by a rule whose left hand side has no type associated with
312: it is ignored. The ML code associated with such rules may return any kind of
313: value; it will be executed for possible side-effects.
314:
315: If a terminal or nonterminal occurs only once on the rhs of a rule, its
316: value is also in {symbol}, as well as in {symbol}1
317:
318: Bugs
319: ----
320: There should be a better way for semantic-action routines to print out errors
321: and get the appropriate range of line numbers in the message automatically.
322:
323: Functors should be used to make the lexer more independent of the parser.
324: Speed should be improved in future versions.
325:
326: Sample specification
327: --------------------
328: (* sample.grm *)
329: type Lineno = int
330: val lineno = ref 1
331: fun error s l = output std_out
332: ("line " ^ makestring (l:int) ^ ":" ^ s ^ "\n")
333: fun lookup s = ordof(s,0) - ord("a")+1
334: %%
335: %structure Calc
336: %eof EOF
337: %start START
338:
339: %left SUB PLUS
340: %left TIMES DIV
341: %right CARAT
342:
343: %term ID of string | NUM of int | PLUS | TIMES | PRINT | EOS | EOF |
344: CARAT | DIV | SUB
345: %nonterm EXP of int | START | STMT | STMT_LIST
346:
347: %%
348: START : STMT_LIST ()
349: STMT_LIST : STMT_LIST STMT EOS ()
350: STMT_LIST : ()
351: STMT : PRINT EXP (print EXP; print "\n"; flush_out std_out)
352: STMT : EXP ()
353: EXP : NUM (NUM)
354: | ID (lookup ID)
355: | EXP PLUS EXP (EXP1+EXP2)
356: | EXP TIMES EXP (EXP1*EXP2)
357: | EXP DIV EXP (EXP1 div EXP2)
358: | EXP SUB EXP (EXP1-EXP2)
359: | EXP CARAT EXP (let fun e (m,0) = 1
360: | e (m,i) = m*e(m,i-1)
361: in e (EXP1,EXP2)
362: end)
363:
364: Sample ML-Lex specification
365: ---------------------------
366: (* sample.lex *)
367: open Calc.LexValue
368: type lexresult=V
369: val eof = fn () => EOF
370: %%
371: alpha=[A-Za-z];
372: digit=[0-9];
373: ws = [\ \t\n];
374: %%
375: {ws}+ => (lex());
376: {digit}+ => (NUM (revfold (fn (a,r) => ord(a)-ord("0")+10*r) (explode yytext) 0));
377: "+" => (PLUS);
378: "*" => (TIMES);
379: ";" => (EOS);
380: {alpha}+ => (if yytext="print" then PRINT else ID yytext);
381: "-" => (SUB);
382: "^" => (CARAT);
383: "/" => (DIV);
384: . => (Calc.HDR.error ("ignoring bad character " ^ yytext); lex());
385:
386:
387: Sample "main" module
388: --------------------
389: (* sample.sml *)
390: structure Sample =
391: struct
392: fun run filename =
393: (* more suitable for non-interactive use *)
394: let val in_str = open_in filename
395: val lexer = Mlex.makeLexer (input in_str)
396: val p = (Calc.HDR.lineno := 0;
397: Calc.parse lexer (5,5))
398: in (close_in in_str; p)
399: end
400:
401: fun run_std_in () =
402: (* more suitable for interactive use *)
403: let val lexer = Mlex.makeLexer (fn _ => input_line std_in)
404: val p = (Calc.HDR.lineno := 0;
405: Calc.parse lexer (0,0))
406: in p
407: end
408: end
409:
410: Sample input
411: ------------
412: (* sample.input, contains an intentional syntax error *)
413: print 4+2;
414: a+b
415: print b*c;
416:
417:
418: How to try the sample program
419: ----------------------------
420:
421: % sml
422: - use "mlyacc.sml"; (* load the parser generator *)
423: - use "lexgen.sml"; (* load the lexical analyzer generator *)
424: - open LexGen ParseGen;
425: - exportML "yacclex"; (* save the image for later use *)
426:
427: - lexGen "sample.lex"; (* creates the file sample.lex.sml *)
428: - parseGen "sample.grm"; (* creates the file sample.grm.sml *)
429: - ^D (* exiting here is optional, of course *)
430:
431: % sml
432: - use "sample.grm.sml"; (* compile the parser *)
433: - use "sample.lex.sml"; (* compile the lexer *)
434: - use "sample.sml"; (* compile the main module *)
435: - Sample.run "sample.input"; (* run the sample program *)
436:
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.