|
|
1.1 root 1: .\" @(#)ss4 6.1 (Berkeley) 5/8/86
2: .\"
3: .SH
4: 4: How the Parser Works
5: .PP
6: Yacc turns the specification file into a C program, which
7: parses the input according to the specification given.
8: The algorithm used to go from the
9: specification to the parser is complex, and will not be discussed
10: here (see
11: the references for more information).
12: The parser itself, however, is relatively simple,
13: and understanding how it works, while
14: not strictly necessary, will nevertheless make
15: treatment of error recovery and ambiguities much more
16: comprehensible.
17: .PP
18: The parser produced by Yacc consists
19: of a finite state machine with a stack.
20: The parser is also capable of reading and remembering the next
21: input token (called the
22: .I lookahead
23: token).
24: The
25: .I "current state"
26: is always the one on the top of the stack.
27: The states of the finite state machine are given
28: small integer labels; initially, the machine is in state 0,
29: the stack contains only state 0, and no lookahead token has been read.
30: .PP
31: The machine has only four actions available to it, called
32: .I shift ,
33: .I reduce ,
34: .I accept ,
35: and
36: .I error .
37: A move of the parser is done as follows:
38: .IP 1.
39: Based on its current state, the parser decides
40: whether it needs a lookahead token to decide
41: what action should be done; if it needs one, and does
42: not have one, it calls
43: .I yylex
44: to obtain the next token.
45: .IP 2.
46: Using the current state, and the lookahead token
47: if needed, the parser decides on its next action, and
48: carries it out.
49: This may result in states being pushed onto the stack, or popped off of
50: the stack, and in the lookahead token being processed
51: or left alone.
52: .PP
53: The
54: .I shift
55: action is the most common action the parser takes.
56: Whenever a shift action is taken, there is always
57: a lookahead token.
58: For example, in state 56 there may be
59: an action:
60: .DS
61: IF shift 34
62: .DE
63: which says, in state 56, if the lookahead token is IF,
64: the current state (56) is pushed down on the stack,
65: and state 34 becomes the current state (on the
66: top of the stack).
67: The lookahead token is cleared.
68: .PP
69: The
70: .I reduce
71: action keeps the stack from growing without
72: bounds.
73: Reduce actions are appropriate when the parser has seen
74: the right hand side of a grammar rule,
75: and is prepared to announce that it has seen
76: an instance of the rule, replacing the right hand side
77: by the left hand side.
78: It may be necessary to consult the lookahead token
79: to decide whether to reduce, but
80: usually it is not; in fact, the default
81: action (represented by a ``.'') is often a reduce action.
82: .PP
83: Reduce actions are associated with individual grammar rules.
84: Grammar rules are also given small integer
85: numbers, leading to some confusion.
86: The action
87: .DS
88: \fB.\fR reduce 18
89: .DE
90: refers to
91: .I "grammar rule"
92: 18, while the action
93: .DS
94: IF shift 34
95: .DE
96: refers to
97: .I state
98: 34.
99: .PP
100: Suppose the rule being reduced is
101: .DS
102: A \fB:\fR x y z ;
103: .DE
104: The reduce action depends on the
105: left hand symbol (A in this case), and the number of
106: symbols on the right hand side (three in this case).
107: To reduce, first pop off the top three states
108: from the stack
109: (In general, the number of states popped equals the number of symbols on the
110: right side of the rule).
111: In effect, these states were the ones
112: put on the stack while recognizing
113: .I x ,
114: .I y ,
115: and
116: .I z ,
117: and no longer serve any useful purpose.
118: After popping these states, a state is uncovered
119: which was the state the parser was in before beginning to
120: process the rule.
121: Using this uncovered state, and the symbol
122: on the left side of the rule, perform what is in
123: effect a shift of A.
124: A new state is obtained, pushed onto the stack, and parsing continues.
125: There are significant differences between the processing of
126: the left hand symbol and an ordinary shift of a token,
127: however, so this action is called a
128: .I goto
129: action.
130: In particular, the lookahead token is cleared by a shift, and
131: is not affected by a goto.
132: In any case, the uncovered state contains an entry such as:
133: .DS
134: A goto 20
135: .DE
136: causing state 20 to be pushed
137: onto the stack, and become the current state.
138: .PP
139: In effect, the reduce action ``turns back the clock'' in the parse,
140: popping the states off the stack to go back to the
141: state where the right hand side of the rule was first seen.
142: The parser then behaves as if it had seen the left side at that time.
143: If the right hand side of the rule is empty,
144: no states are popped off of the stack: the uncovered state
145: is in fact the current state.
146: .PP
147: The reduce action is also important in the treatment of user-supplied
148: actions and values.
149: When a rule is reduced, the code supplied with the rule is executed
150: before the stack is adjusted.
151: In addition to the stack holding the states, another stack,
152: running in parallel with it, holds the values returned
153: from the lexical analyzer and the actions.
154: When a shift takes place, the external variable
155: .I yylval
156: is copied onto the value stack.
157: After the return from the user code, the reduction is carried out.
158: When the
159: .I goto
160: action is done, the external variable
161: .I yyval
162: is copied onto the value stack.
163: The pseudo-variables $1, $2, etc., refer to the value stack.
164: .PP
165: The other two parser actions are conceptually much simpler.
166: The
167: .I accept
168: action indicates that the entire input has been seen and
169: that it matches the specification.
170: This action appears only when the lookahead token is
171: the endmarker, and indicates that the parser has successfully
172: done its job.
173: The
174: .I error
175: action, on the other hand, represents a place where the parser
176: can no longer continue parsing according to the specification.
177: The input tokens it has seen, together with the lookahead token,
178: cannot be followed by anything that would result
179: in a legal input.
180: The parser reports an error, and attempts to recover the situation and
181: resume parsing: the error recovery (as opposed to the detection of error)
182: will be covered in Section 7.
183: .PP
184: It is time for an example!
185: Consider the specification
186: .DS
187: %token DING DONG DELL
188: %%
189: rhyme : sound place
190: ;
191: sound : DING DONG
192: ;
193: place : DELL
194: ;
195: .DE
196: .PP
197: When Yacc is invoked with the
198: .B \-v
199: option, a file called
200: .I y.output
201: is produced, with a human-readable description of the parser.
202: The
203: .I y.output
204: file corresponding to the above grammar (with some statistics
205: stripped off the end) is:
206: .DS
207: state 0
208: $accept : \_rhyme $end
209:
210: DING shift 3
211: . error
212:
213: rhyme goto 1
214: sound goto 2
215:
216: state 1
217: $accept : rhyme\_$end
218:
219: $end accept
220: . error
221:
222: state 2
223: rhyme : sound\_place
224:
225: DELL shift 5
226: . error
227:
228: place goto 4
229:
230: state 3
231: sound : DING\_DONG
232:
233: DONG shift 6
234: . error
235:
236: state 4
237: rhyme : sound place\_ (1)
238:
239: . reduce 1
240:
241: state 5
242: place : DELL\_ (3)
243:
244: . reduce 3
245:
246: state 6
247: sound : DING DONG\_ (2)
248:
249: . reduce 2
250: .DE
251: Notice that, in addition to the actions for each state, there is a
252: description of the parsing rules being processed in each
253: state. The \_ character is used to indicate
254: what has been seen, and what is yet to come, in each rule.
255: Suppose the input is
256: .DS
257: DING DONG DELL
258: .DE
259: It is instructive to follow the steps of the parser while
260: processing this input.
261: .PP
262: Initially, the current state is state 0.
263: The parser needs to refer to the input in order to decide
264: between the actions available in state 0, so
265: the first token,
266: .I DING ,
267: is read, becoming the lookahead token.
268: The action in state 0 on
269: .I DING
270: is
271: is ``shift 3'', so state 3 is pushed onto the stack,
272: and the lookahead token is cleared.
273: State 3 becomes the current state.
274: The next token,
275: .I DONG ,
276: is read, becoming the lookahead token.
277: The action in state 3 on the token
278: .I DONG
279: is ``shift 6'',
280: so state 6 is pushed onto the stack, and the lookahead is cleared.
281: The stack now contains 0, 3, and 6.
282: In state 6, without even consulting the lookahead,
283: the parser reduces by rule 2.
284: .DS
285: sound : DING DONG
286: .DE
287: This rule has two symbols on the right hand side, so
288: two states, 6 and 3, are popped off of the stack, uncovering state 0.
289: Consulting the description of state 0, looking for a goto on
290: .I sound ,
291: .DS
292: sound goto 2
293: .DE
294: is obtained; thus state 2 is pushed onto the stack,
295: becoming the current state.
296: .PP
297: In state 2, the next token,
298: .I DELL ,
299: must be read.
300: The action is ``shift 5'', so state 5 is pushed onto the stack, which now has
301: 0, 2, and 5 on it, and the lookahead token is cleared.
302: In state 5, the only action is to reduce by rule 3.
303: This has one symbol on the right hand side, so one state, 5,
304: is popped off, and state 2 is uncovered.
305: The goto in state 2 on
306: .I place ,
307: the left side of rule 3,
308: is state 4.
309: Now, the stack contains 0, 2, and 4.
310: In state 4, the only action is to reduce by rule 1.
311: There are two symbols on the right, so the top two states are popped off,
312: uncovering state 0 again.
313: In state 0, there is a goto on
314: .I rhyme
315: causing the parser to enter state 1.
316: In state 1, the input is read; the endmarker is obtained,
317: indicated by ``$end'' in the
318: .I y.output
319: file.
320: The action in state 1 when the endmarker is seen is to accept,
321: successfully ending the parse.
322: .PP
323: The reader is urged to consider how the parser works
324: when confronted with such incorrect strings as
325: .I "DING DONG DONG" ,
326: .I "DING DONG" ,
327: .I "DING DONG DELL DELL" ,
328: etc.
329: A few minutes spend with this and other simple examples will
330: probably be repaid when problems arise in more complicated contexts.
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.