|
|
1.1 root 1: \documentstyle[titlepage,fullpage,12pt,clbiba]{article}
2: \begin{document}
3: \newcommand{\twigcomp}{{\tt twig}}
4: \newcommand{\Twigcomp}{{\tt Twig}}
5: \newcommand{\twiglang}{{\it twig}}
6: \newcommand{\Twiglang}{{\it Twig}}
7: \title{Twig Reference Manual}
8:
9: \date{
10: \begin{minipage}{.8\textwidth}
11: \normalsize
12: \noindent
13: \Twiglang{} is a language for manipulating trees. A \twiglang{}
14: program consists of a set
15: of pattern-action rules together with associated declarations. Patterns
16: describe trees to be matched. Actions calculate costs, perform tree
17: manipulations and other functions such as emitting code. A \twiglang{} program is
18: translated by the \twigcomp{} compiler into subroutines and tables in a host
19: language. In the current implementation, the host language is C.\\[6pt]
20: A \twiglang{} program manipulates trees by first finding a minimum cost covering of
21: the input tree. The actions of the rules whose pattern parts composes the
22: covering is then executed. The minimum cost covering is determined using
23: dynamic programming. This technique naturally resolves many ambiguities that
24: may be in the specifications.\\[6pt]
25: The prime purpose of \twiglang{} is to create tree manipulation programs.
26: One interesting application of tree manipulation is code generation and
27: \twiglang{} has been used to implement a code generator for the
28: pcc2 compiler on the VAX.
29: \end{minipage}}
30:
31: \author{Steven W.K. Tjiang}
32:
33: \maketitle
34: \tableofcontents\newpage
35: \section{Introduction}
36:
37: \Twiglang{} is a language for writing tree manipulation programs.
38: A \twiglang{} program does this by using a matcher to find a minimal
39: cost covering of a tree.
40: The covering is composed of
41: user supplied templates.
42: Trees are manipulated by
43: executing the actions associated with the templates of their coverings.
44: These actions may rewrite the tree, update other data structures, or
45: generate output.
46:
47: Figure~\ref{overview} gives an overview of the \twiglang{} system.
48: The round boxes in the figure indicates files and the sharp cornered boxes
49: are processing programs.
50: The user writes the \twiglang{} specification file.
51: This file is then compiled by the \twigcomp{} compiler into
52: host language files.
53: In the current version of \twiglang{}, the host language is C.
54: The host language files are then compiled and linked with
55: other user supplied files.
56: The resulting program represented by the large box at the bottom
57: of figure~\ref{overview} is the tree manipulation program.
58: \begin{figure}[hp]
59: \input{pic}
60: \caption{The {\bf Twig } system}
61: \label{overview}
62: \end{figure}
63: The details of this box will be discussed later.
64:
65: \section{Motivation}
66:
67: Although \Twiglang{} is a language for manipulating trees
68: it was originally motivated by
69: pattern directed code generation. In many compilers, the typical
70: structure is a {\sl front end} and a {\sl back end}. The front end reads
71: the input source program, checks its syntax and contextual conditions,
72: and builds an intermediate representation of the program.
73: The language used to do this representation is often called the
74: {\it intermediate representation} and will be abbreviated as IR.
75: The back end
76: takes the IR and translates it into the target machine code.
77:
78: That isn't the whole story but it's close. Both ends may be
79: sets of programs rather than one
80: program. There may also be a
81: {\it middle} that performs transformations on the
82: IR. The intention of these
83: transformations is to improve the target code that comes out of the compiler.
84: For this discussion, we can ignore the middle.
85:
86: IRs can vary greatly from compiler to compiler.
87: However the IRs are implemented, they are encodings of graphs.
88: Usually three types of graphs are evident\cite{dragon}.
89: There is a flowgraph whose nodes are {\sl basic blocks} and
90: the edges corresponds to transfer of control between the basic blocks.
91: Basic blocks are not featureless. They are graphs themselves.
92: Some compilers will represent them as a forest of
93: {\it directed acyclic graphs} or DAGs.
94: The nodes are data values and operations.
95: Since operations yields values, one can think of nodes as
96: corresponding to values. An edge exists from node $a$ to
97: node $b$ if and only if the value corresponding to $a$ depends on the
98: value corresponding to $b$. Other compilers will use trees instead of DAGs.
99: Trees are not
100: as general as DAGs and these compilers overcome that by using some
101: type of labelling or copying.
102:
103: Writing a front end usually
104: involves writing a recognizer
105: for the source language, designing a symbol table,
106: and determining an error recovery strategy. What has made this process easy
107: are parser generators like YACC\cite{yacc}.
108: They separate out the task of recognizing the
109: source language from the other details. This makes the task
110: intellectually more manageable.
111: The compiler implementor now specifies the source language via grammar
112: productions. Symbol table manipulations,
113: and IR construction rules are attached to productions and activated
114: when a source language construct is recognized.
115: The front end is then {\it automatically} generated by processing the
116: specification with a parser generator.
117: The result is
118: a high quality front end whose performance is close to that of a
119: hand-crafted front end.
120:
121: Providing the same type of facilities for automatically building back
122: ends has been much more difficult. Extending current front end
123: parsers to do back ends
124: has been workable but not entirely successful. The basic idea is that
125: the back end implementor writes a {\it machine grammar}
126: for the IR and code generation
127: occurs as actions that are executed when patterns are found
128: in the IR\cite{gana,henry}.
129: Unfortunately, these grammars are large and ambiguous. Using
130: parsers to handle them have usually met with many combinatorial problems
131: \cite{henry}. Furthermore, these code generators are slow
132: although their output is often of high quality.
133: The software engineering advantages one see with parsers in the front end
134: are not so evident in this circumstance. Instead of having to deal with
135: the complexities of code generation, the implementor now has to deal with
136: the complexities of the very large machine grammar.
137:
138: Another approach at automatically building back ends is to start with
139: known good algorithms for generating code. From these algorithms, we
140: can build an abstract interpreter and then specifications can be written
141: to drive them. We hope that by doing this we can derive a much more natural
142: code generation specification technique.
143: \Twiglang{} is one possible result of this approach. The algorithms
144: that form the basis of \twiglang{} are presented in \cite{optimalcode} and
145: \cite{treematch}. \Twiglang{}
146: specifications are much less complex than parser based
147: specifications. Ambiguities are handled naturally and do not complicate
148: the specifications.
149:
150: In \cite{optimalcode}, it is shown how optimal code can be generated
151: for an expression tree
152: given a description of a machine's instruction set.
153: This description is in term of tree templates.
154: The algorithm first determines a minimal cost covering of
155: an expression tree using dynamic programming.
156: The covering is composed from the templates in the
157: description. The code generated is then the instructions
158: represented by the templates in the covering.
159: This algorithm was shown to generate the shortest code sequences for an
160: abstract machine and
161: its running time was shown to be linear in the number of nodes in the
162: expression tree in \cite{optimalcode}.
163:
164: This is all very well in theory but what about in practice.
165: Code length is not always the quantity that one wishes to minimize.
166: One might wish to minimize the number of memory accesses or machine cycles
167: executed by the generated code.
168: Fortunately, the algorithm can also minimize these quantities.
169: In fact, it can minimize any strictly increasing cost function.
170: Let $I$ denote a machine instruction and
171: $I_0I_1I_2...I_n$ denote a sequence of instructions.
172: Then a cost function $f$ is strictly increasing if and only if
173: $f(I_0I_1I_2...I_n) < f(I_0I_1I_2...I_nI_{n+1})$ for all
174: instruction sequences $I_0I_1I_2...I_n$ and instruction
175: $I_{n+1}$. Another shortcoming in \cite{optimalcode}
176: is that no efficient algorithm is given for matching the templates
177: in the tree. This is where \cite{treematch} comes in. In that paper,
178: two algorithms are presented for tree matching. \Twiglang{} combines
179: the work presented in the two papers.
180:
181: \Twiglang{} differs from the algorithm as presented in \cite{optimalcode} in
182: the following ways:
183: \begin{itemize}
184: \item
185: It provides a convenient specification language for the templates.
186: \item
187: Tree templates in \cite{optimalcode} were labelled with either {\sl Register} or
188: {\sl Memory}. This label being the storage class that the instruction corresponding
189: to the template would put a result.
190: \Twiglang{} generalize tree labels so that
191: they can be any symbol.
192: \item
193: Instead of associating an instruction with each template, \twiglang{} associates
194: an action.
195: \item
196: In \cite{optimalcode}, code is generated by a bottom up traversal of the
197: minimum cost covering. \Twiglang{} allows a top down traversal of the cover
198: and also the ability for the actions to rewrite parts of the
199: tree.
200: \item
201: The algorithm in \cite{optimalcode} determines the evaluation order
202: which gives the optimal code.
203: \Twiglang{} does not do this.
204: The evaluation order determination was omitted based on the conjecture that
205: most of the time, simple left to right evaluation of the leaves will yield
206: near optimal results.
207: In situations where a nonstandard evaluation order is required, a new
208: \twiglang{} pattern could be added that codes the order explicitly
209: (see Section~\ref{exec:phase}).
210: \end{itemize}
211:
212: In this manual, a brief discussion will be given of the \twiglang{} language
213: and how \twiglang{} programs can be incorporated with C.
214:
215: Many statements have been made in this manual without proof.
216: If the reader wishes to pursue it further, more details about the algorithms
217: will be presented in an upcoming paper,
218: \cite{aho:gana:tjiang}.
219:
220: \section{The \Twiglang{} Language}
221: \subsection{Lexical Issues and Conventions}
222: Currently the
223: following identifiers are reserved keywords of \twiglang{}:
224:
225: \begin{tabbing}
226: \hspace{1in}\=\hspace{1in}\=\hspace{1in}\=\kill
227: action \> cost \> insert \\
228: label \> node \> prologue
229: \end{tabbing}
230:
231: \noindent ;:(),= are special characters. All blanks, tabs, formfeeds and
232: newlines are ignored by \twiglang{} but they may not appear inside
233: identifiers and numbers. Identifiers are nonempty strings
234: made of letters, digits or underscores and starting with a letter.
235: Numbers are nonempty string of digits. Code fragments or action parts
236: are enclosed in braces. Inside code fragments, C lexical rules must
237: be obeyed except that strings of the form $\$...\$$ that are not inside C
238: strings have special meaning to \twiglang{}. In the following sections, $id$
239: denotes an identifier; $int_1$, $int_2$ denotes numbers; $ccode$
240: denotes C code fragments; $...$ indicates repetition of the previous
241: item; $[...]$ indicates that $...$ is optional.
242:
243: The input to the \twiglang{} program will be referred to as the subject tree.
244:
245: \subsection{Rules}
246:
247: The fundamental unit of a \twiglang{} program is a rule.
248:
249: \begin{verse}
250: $label\_id$ {\bf :} $tree\_pattern$\quad {\bf [} $cost$ {\bf ]}
251: {\bf [ =} $action$ {\bf ]}
252: \end{verse}
253:
254: \noindent Intuitively, the pattern is used to match a subtree.
255: The $cost$ code
256: fragment is then evaluated. The resulting cost is recorded by
257: the matcher for use in dynamic programming. The $action$ is executed
258: if the rule is part of the least cost cover of the tree.
259: The details of pattern matching will be discussed in Section~\ref{pattern}.
260:
261: If the $cost$ part is missing, \twigcomp{} will insert default code
262: that returns the special value, {\tt DEFAULT\_COST}.
263: A missing $action$ part indicates that nothing will be done when a
264: match is found.
265:
266: \subsection{Tree Patterns}
267: \label{TreePatterns}
268: Tree patterns are specified
269: in prefix parenthesized form and can be described by the
270: following BNF:
271:
272: \begin{verse}
273: $tree\_pattern\to node\_id\;\vert\;label\_id$\\
274: $tree\_pattern\to node\_id \;{\bf (}\; tree\_list \;{\bf )}$\\
275: $tree\_list \to tree \;{\bf ,}\; tree\_list \;\vert\; tree$\\
276: \end{verse}
277:
278: \begin{figure}
279: \unitlength 0.1in
280: \begin{center}
281: \begin{picture}(25,16)
282: \put(7,0){\framebox(4,3){\tt id}}
283: \put(20,0){\framebox(5,3){\tt const}}
284: \put(13,6){\framebox(5,3){\tt plus}}
285: \put(6,12){\framebox(5,3){\tt plus}}
286: \put(0,6){\framebox(4,3){\tt id}}
287: \put(11,3){\line(2,3){2}}
288: \put(20,3){\line(-2,3){2}}
289: \put(13,9){\line(-2,3){2}}
290: \put(4,9){\line(2,3){2}}
291: \end{picture}
292: \end{center}
293: \vspace{0.5in}
294: \begin{verse}
295: \begin{center}
296: \tt plus(id, plus(id, const))
297: \end{center}
298: \end{verse}
299: \caption{A Tree and its prefixed form}
300: \label{treepattern}
301: \end{figure}
302: \noindent Figure~\ref{treepattern} gives an example of a tree pattern
303: and its prefix parenthesized form.
304:
305: There are two types of symbols: $node\_id$\/s and $label\_id$\/s.
306: $Node\_id$\/s are used to denote internal nodes and leaves.
307: $Label\_id$\/s label tree patterns and are analogous to
308: nonterminals in context free grammars.
309: For example, the following \twiglang{} rules without their action parts describe
310: simple expression trees with the {\tt plus} operator.
311:
312: \begin{verse}
313: {\tt\begin{tabbing}
314: expr: \= plus(expr, expr)\\
315: expr: \> identifier\\
316: expr: \> constant
317: \end{tabbing}}
318: \end{verse}
319:
320: \noindent Here, {\tt identifier} and {\tt constant} are $node\_id$\/s
321: representing leaves, and
322: {\tt plus} is a $node\_id$ representing an internal node whereas
323: {\tt expr} is a
324: $label\_id$. Pattern leaves that are $label\_id$\/s are called
325: {\sl labelled leaves}.
326: In the first rule, both leaves of the pattern are labelled.
327:
328: \Twigcomp{} associates an integer with each node symbol and label symbol. These
329: integers are used by the \twiglang{} pattern matcher as encodings for the
330: symbols.
331: As the matcher traverses the
332: tree, a user supplied subroutine is called to provide an integer
333: corresponding to each node.
334:
335: \subsection{Costs}
336:
337: To increase the flexibility of representing costs, the tree matcher
338: views costs as an {\sl abstract data type} or ADT.
339: For example, costs may be represented as an integer or
340: as a vector of integers with each element representing the cost of a specific
341: resource.
342: A cost ADT suitable for \twiglang{} must have the following four definitions:
343:
344: \begin{itemize}
345: \item {\tt COST} is a C type.
346: Although the proper functioning of the tree matcher
347: does not depend on the internal details of the {\tt COST} type, it must have
348: the type information for storage allocation purposes.
349: \item {\tt INFINITY} is the maximum attainable cost value.
350: The matcher uses {\tt INFINITY} to initialize its data structures.
351: \item {\tt DEFAULT\_COST} is the cost
352: value returned by rules without a cost part.
353: \item
354: {\tt COSTLESS} is a function of
355: two cost values. It must return true if and only if the first argument is
356: less than the second.
357: \end{itemize}
358:
359: \subsection{Trees}
360:
361: As with costs, trees are treated as an ADT. Here, using an ADT is
362: even more important because trees come in a variety of shapes and
363: representations.
364: \Twiglang{} would be overly complicated if it had to
365: know any more than the fundamental properties of trees. Thus,
366: \twiglang{} treats trees as an acyclic directed
367: graph of almost featureless nodes with one
368: distinguished node, the root. Each node has only one feature and that
369: is an integer corresponding to the $node\_id$\/s discussed above.
370: To provide this
371: view to \twiglang{}, the user must provide the following definitions and
372: subroutines.
373:
374: \begin{itemize}
375: \item {\tt NODEPTR} is the type of a node. The actual
376: details of
377: the {\tt NODEPTR}
378: are irrelevant and \twigcomp{} uses this definition only for storage
379: allocation and declaration purposes.
380: \item
381: {\tt NULL} is used to denote a null {\tt NODEPTR} value.
382: In the current implementation this is the same {\tt NULL} used by the
383: C standard I/O package and need not be defined by the user.
384: This restricts the {\tt NODEPTR} representation to be a pointer.
385: \item
386: {\tt mtGetNodes$(r,n)$} returns the $n$th child of $r$ where $r$
387: is a {\tt NODEPTR}. The
388: leftmost child is {\tt mtGetNodes}$(r,1)$. If $n > \hbox{degree}(r)$
389: then the function should return {\tt NULL}. \Twigcomp{} expects the
390: expression {\tt mtGetNodes}$({\tt NULL}, 0)$
391: to denote the root node of
392: the subject tree.
393: \item
394: {\tt mtValue$(r)$} returns the integer associated with $r$
395: (See section~\ref{TreePatterns}).
396: \item
397: {\tt mtSetNodes$(r,n,c)$} replaces the $n$th child of $r$ with
398: $c$. This routine may be used to replace whole subtrees with others.
399: {\tt mtSetNodes}$({\tt NULL},0,c)$ is used by \twigcomp{} to set the
400: root of the subject tree to $c$.
401: \end{itemize}
402:
403: \subsection{Prologue and Inserts}
404:
405: \begin{verse}
406: {\tt prologue} $ccode$;
407: \end{verse}
408:
409: \noindent signals to \twigcomp{} that $ccode$ should be inserted at the
410: beginning of the output source file. $Ccode$ should contain
411: definitions relevant to the C code in
412: rules that are defined later in the \twiglang{} source file.
413:
414: \begin{verse}
415: {\tt insert} $ccode$;
416: \end{verse}
417:
418: \noindent causes $ccode$ to be placed into the source file.
419: There can be multiple inserts and they will be placed into the source
420: file
421: in the order that they appear.
422:
423: \subsection{Declarations}
424:
425: All \twiglang{} identifiers are declared before they are used.
426:
427: \begin{verse}
428: {\tt node $id[(int_1)] [= int_2] ...$;\hfil}
429: \end{verse}
430:
431: \noindent A node declaration declares one or more identifier to be
432: associated with nodes of the subject tree. The identifiers are
433: assigned numbers by \twigcomp{} but the user can overide the assigned number
434: by specifying $int_2$. The degree of the node identifer, the
435: number of children, is assumed to be fixed. Normally, \twigcomp{} will
436: deduce the degree when $id$ is used in a rule. However, the user may
437: explicitly give the degree by specifying $int_1$.
438:
439: \begin{verse}
440: {\tt label $id ...$;}
441: \end{verse}
442:
443: A label declaration indicates that the following $id$'s are to be used
444: as labels of rules.
445:
446: \subsection{Costs and Action code}
447:
448: Code fragments such as the $cost$ and $action$ clauses of a rule are
449: specfied by enclosing C code in braces. All legal C constructs are
450: permitted in code fragments. In addition, the following are provided
451: for convenient access to the subject tree and internal data
452: structures of the matcher.
453:
454: \begin{itemize}
455: \item $\$\%n\$$ denotes a pointer value to the matcher data
456: structure for the $n$th labelled leaf. Thus to access the cost value
457: associated with that leaf, the notation $\$\%n\$\to${\tt cost} may be used.
458:
459: \item $\$\$$ denotes a pointer value to the root of the subject
460: tree.
461:
462: \item $\$n_1.n_2.n_3.\, \hbox{...}\,.n_{k-1}.n_k\$$ denotes
463: a pointer value to the $n_k$th child of the $n_{k-1}th$ child of the
464: $n_{k-2}$th child of \hbox{...} of the $n_1$th child of the root of
465: the subject tree. Each $n_i$ is a positive non-zero integer.
466: \end{itemize}
467:
468: Some special constructs are available to code fragments in the cost
469: part of the specification. The
470: statement ``{\tt ABORT;}'' when encountered during the execution of
471: the cost code, signals to the matcher that this pattern is to be
472: rejected. When a ``{\tt REWRITE;}'' statement is encountered, control
473: is returned to the matcher and the rule will become a {\sl rewrite}
474: mode match. When the end of the cost code fragment is reached,
475: control is returned to the matcher and the rule becomes a {\sl defer}
476: mode match.
477: The statement ``{\tt TOPDOWN;}'' is like ``{\tt REWRITE;}'' except that
478: the mode of the match becomes {\sl topdown}. The meanings of these modes
479: will be explained in Section~\ref{pattern}.
480:
481: Cost values are returned to the matcher by assigning to
482: the ``{\tt cost}'' variable in the cost clause. If no assignment is
483: made to the {\tt cost} variable then the returned cost will be {\tt
484: DEFAULT\_COST}.
485:
486: Action clauses are expected to return a new tree which will replace
487: the subject tree. This is done by returning using the
488: ``{\tt return($new\_tree$);}'' statement.
489: where {\tt $new\_tree$} is of type
490: {\tt NODEPTR}.
491: If execution reaches the end of the action clause, the
492: matcher resumes execution and the subject tree is not modified.
493:
494: \section{Pattern Matcher Operation}
495: \label{pattern}
496:
497: The
498: pattern matcher operates in two phases: the costing phase and the execution
499: phase.
500: During
501: the costing phase, a minimal cost covering of the subject tree is found.
502: The execution phase invokes the actions that are associated with the patterns
503: making up the covering. Execution phases may begin during the costing phase
504: to execute the covering of a subtree as described in Section~\ref{exec:phase}.
505:
506: \subsection{The Costing Phase}
507: \label{cost:phase}
508:
509: Let the patterns of the \twiglang{} specifications be interpreted as grammar
510: productions with {\sl label\_id}s as non-terminals and {\sl node\_id}s
511: as terminals.
512: If we add a production
513: $S\,\to\,\hbox{\sl label\_id}$ for all {\sl label\_id}s where $S$ is the
514: start symbol,
515: then a covering for a tree is analogous to a derivation
516: for the prefixed form of the tree.
517: Figure~\ref{gram:interp} shows some patterns and its corresponding grammar.
518: \begin{figure}
519: \begin{center}
520: \begin{verse}
521: \tt
522: e: plus(e,e)\\
523: e: plus(e, plus(e,e))\\
524: e: id\\
525: e: const\\
526: \end{verse}
527: \end{center}
528: \vspace{0.3in}
529: \begin{center}\begin{verse}
530: $S\to e$\\
531: $e\to plus(e,e)$\\
532: $e\to plus(e,plus(e,e)$\\
533: $e\to id$\\
534: $e\to const$\\
535: \end{verse}\end{center}
536: \caption{Pattern and Grammar}
537: \label{gram:interp}
538: \end{figure}
539: The productions of the form $S\to\hbox{\sl label\_id}$ reflect the fact that
540: any {\sl label\_id} may start a cover.
541:
542: The matcher finds the least cost cover by doing a preorder traversal of
543: the subject tree.
544: At the same time, it
545: builds a skeleton tree that is structurally isomorphic to the subject tree.
546: When a match is discovered the cost clause of the pattern is invoked to
547: calculate the cost. Many patterns with different labels could match at any
548: given node but only the lowest cost pattern for each label is recorded in
549: the skeleton.
550:
551: When a pattern is matched, its label is then used as input to the pattern
552: matcher so that matching of patterns with labelled leaves can begin.
553: This process
554: is analogous to a reduce action in
555: bottom up parsers.
556:
557: \subsection{The Execution Phase}
558: \label{exec:phase}
559:
560: The execution phase starts when the costing phase is complete or when
561: a sufficiently low cost {\sl rewrite} mode rule is encountered.
562: Let $M$ be a matching
563: rule and $L_1, L_2, L_3, ... ,L_n$ be
564: the labelled leaves of the matching pattern.
565: The following procedure is used to execute the action parts of $M$.
566:
567: \vbox{
568: \vspace{2ex}
569: \noindent {\bf Procedure \it Execute}
570:
571: \begin{itemize}
572: \item If $M$ is a {\sl deferred} mode match then
573: execution occurs by first applying {\it Execute} to the $L_i$s
574: from left to right. That is, in the order that they appear in
575: the prefix form of the tree.
576: When all the $L_i$s are executed, the action part of $M$ is
577: invoked.
578:
579: \item If $M$ is a {\sl rewrite } mode match then
580: just the action part of $M$ is executed. When the action part returns
581: the matcher deletes the skeleton corresponding to the subtree that
582: $M$ matched and rescans the new subtree that may have been substituted
583: by executing $M$.
584:
585: \item If $M$ is a {\sl topdown} mode match then only
586: the action part of $M$ is executed. To execute a
587: labelled leaves, $L_i$, $M$'s action part may do so explicitly
588: by using the {\tt tDO($\$\%i\$$)} macro. This allows
589: the user to arbitrary order the execution of the labelled leaves.
590: \end{itemize}
591: }
592:
593: \noindent When the costing phase is over, the execution phase is started
594: by picking the lowest cost match at the root of the subject tree.
595: This will be the root of the minimum cost cover.
596: The match is executed as described above.
597:
598: The execution phase may begin before the costing phase is over.
599: This occurs when a rewrite rule matches and its cost is lower than
600: all other matches at that subtree. In this situation the rewrite rule
601: is executed immediately and the new rewritten subtree is scanned once more.
602: The prescence of rewrite rules throws a wrench in the theoretical niceties
603: of the matching algorithm. For example, there is now no guarantee
604: that the algorithm will terminate because the tree can
605: be repeatedly enlarged.
606: Rewrite rules can be dangerous. They can
607: be triggered unintentionally if the user is not careful to abort them
608: in situations where they are not wanted.
609: However, if used with care they can help to reduce the size of the \twiglang{}
610: specification.
611:
612: \section{Some Examples}
613:
614: \subsection{Expression Trees to Prefix}
615: A \twiglang{} program to print out the prefix form of expression
616: parse trees
617: is shown in Figure~\ref{trees:prefix}.
618:
619: \begin{itemize}
620:
621: \item The rules do not have cost calculations.
622: Since there are no
623: ambiguities costs are not necessary.
624:
625: \item The second rule is a topdown mode rule. This is essential in printing
626: the prefix form. If it was omitted the postfix form would be printed.
627:
628: \item Before any matching can begin {\tt \_matchinit} must be called.
629: Trees can then be processed by calling {\tt \_match} after arranging that
630: the call
631: {\tt mtGetNodes}$({\tt NULL},0)$ returns the root of the subject tree.
632: \end{itemize}
633:
634: \begin{figure}[tp]
635: \begin{tabbing}
636: \hspace{1in}\=\hspace{1in}\=\hspace{1in}\=\hspace{1in}\=
637: \hspace{1in}\=\hspace{1in}\=\hspace{1in}\=\hspace{1in}\=\kill
638: prologue \> $\{$\\[4pt]
639: typedef \> struct node $*$\>\>{\tt NODEPTR};\\
640: \#define \> {\tt COST} \>\> int\\
641: \#define \> {\tt INFINITY}\>\> 100000\\
642: \#define \> {\tt DEFAULT\_COST} \>\> 0\\[4pt]
643: {\tt NODEPTR} Root;\\
644: struct code \> $\{$\\
645: \> char op; \> \> $/*$ null if node is a leaf $*/$\\
646: \> NODEPTR left, right;\\
647: \> char $*$id;\\
648: $\}$;\\[4pt]
649: $\}$;\\[6pt]
650: node \> nOp nIdent;\\
651: label \> lExpr;\\[6pt]
652: lExpr:\> nIdent\\
653: \>= $\{$ printf(``$\%$s'', id); $\}$;\\[4pt]
654: lExpr:\>nOp(lExpr,lExpr)\\
655: \>$\{$ {\tt TOPDOWN};$\}$\\
656: \>= $\{$\\
657: \>\>putchar($\$\$\to$op);\\
658: \>\>tDO($\$\%1\$$);\\
659: \>\>tDO($\$\%2\$$);\\
660: \>$\}$;\\[6pt]
661: insert\>$\{$\\[4pt]
662: mtValue(root)\\
663: \>{\tt NODEPTR} root;\\
664: $\{$\\
665: \>if(root$\to$op==0)\\
666: \>\>return(nIdent);\\
667: \>else\\
668: \>\>return(nOp);\\
669: $\}$\\[4pt]
670: \end{tabbing}
671: \caption{Printing Expression trees to Prefix form}
672: \end{figure}
673: \addtocounter{figure}{-1}
674: \begin{figure}[tp]
675: \begin{tabbing}
676: \hspace{1in}\=\hspace{1in}\=\hspace{1in}\=\hspace{1in}\=
677: \hspace{1in}\=\hspace{1in}\=\hspace{1in}\=\hspace{1in}\=\kill
678: mtGetNodes(r,n)\\
679: \>{\tt NODEPTR} r;\\
680: \>int n;\\
681: $\{$\\
682: \>if(n==1)\\
683: \>\>return(r$\to$left);\\
684: \>else if(n==1)\\
685: \>\>return(r$\to$right);\\
686: \>else if(n==0 $\&\&$ r=={\tt NULL})\\
687: \>\>return(Root);\\
688: \>else return({\tt NULL});\\
689: $\}$\\[4pt]
690: mtSetNodes(r,n, c)\\
691: \>{\tt NODEPTR} r, c;\\
692: \>int n;\\
693: $\{$\\
694: \>if(n==1)\\
695: \>\>r$\to$left = c;\\
696: \>else if(n==2)\\
697: \>\>r$\to$right = c;\\
698: \>else if(n==0)\\
699: \>\>Root = c;\\
700: $\}$\\[4pt]
701: main(argc,argv)\>\>\>$/*$ called by user only $*/$\\
702: \>char $**$argv;\\
703: $\{$\\
704: \>\_matchinit();\>\>$/*$ initialize the matcher $*/$\\
705: \>$...$ {\sl get a tree and set Root to it}\\
706: \>\_match();\>\>$/*$ do the match $*/$\\
707: $\}$\\
708: $\}$\\
709: \end{tabbing}
710: \caption{(cont.) Printing Expression trees to Prefix form}
711: \label{trees:prefix}
712: \end{figure}
713:
714: \subsection{Short Circuited Boolean Evaluation}
715:
716: Short circuited boolean evaluation is naturally topdown.
717: A fragment of \twiglang{} program that implements this
718: is shown in Figure~\ref{short:circuit}.
719:
720: \begin{itemize}
721:
722: \item Labels for true and false branches are passed down to
723: rules below by putting them into the nodes. Another way of
724: accomplishing this is to use an auxillary argument stack.
725:
726: \item The rules assume that every test generates a branch to both
727: the false and true label. On conventional machines, extra branches
728: can be eliminated by recognizing that one could generate code that
729: falls through to its true or false labels. This can be done by
730: separating {\sl lTest} into {\sl lTrueTest} and {\sl lFalseTest} which
731: branches only when true and false respectively. Rules can then be
732: written to take advantage of this.
733: \end{itemize}
734:
735: \begin{figure}
736: \begin{tabbing}
737: \hspace{1in}\=\hspace{1in}\=\hspace{1in}\=\hspace{1in}\=
738: \hspace{1in}\=\hspace{1in}\=\hspace{1in}\=\hspace{1in}\=\kill
739: prologue \> $\{$\\[4pt]
740: \#include\>``otherdefs.h''\\
741: union node $\{$\\
742: \> $...$ \>\>$/*$ definition of other node types $*/$\\
743: \> union $\{$\\
744: \>\> int operation;\\
745: \>\> {\tt NODEPTR} left, right;\\
746: \>\> {\tt LABEL} falselab, truelab;\>\>$/*$ {\tt LABEL} will defined
747: elsewhere $*/$\\
748: \>\> $...$ \>\> $/*$ other definitions relevant to {\sl test nodes} $*/$\\
749: \> $\}$ test;
750: \> $...$ \>\>$/*$ definition of other node types $*/$\\
751: $\}$;\\[4pt]
752: $\}$;\\[6pt]
753: node\>nAnd nOr nNot nLess;\\
754: label\>lTest lExpr;\\[6pt]
755: lTest:\>nAnd(lTest,lTest)\\
756: \>$\{$ {\tt TOPDOWN}; $\}$\\
757: \>=$\{$\\
758: \>\>$\$1\$\to$test.falselab = $\$\$\to$test.falselab;\\
759: \>\>$\$1\$\to$test.truelab = getlabel();\\
760: \>\>tDO($\$\%1\$$);\\
761: \>\>printlab($\$1\$\to$test.truelab);\\
762: \>\>$\$2\$\to$test.falselab = $\$\$\to$test.falselab;\\
763: \>\>$\$2\$\to$test.truelab = $\$\$\to$test.truelab;\\
764: \>\>tDO($\$\%2\$$);\\
765: \>$\}$;\\[4pt]
766: lTest:\>nOr(lTest,lTest)\\
767: \>$\{$ {\tt TOPDOWN}; $\}$\\
768: \>=$\{$\\
769: \>\>$\$1\$\to$test.falselab = getlabel();\\
770: \>\>$\$1\$\to$test.truelab = $\$\$\to$test.truelab;\\
771: \>\>tDO($\$\%1\$$);\\
772: \>\>$\$2\$\to$test.truelab = $\$\$\to$test.truelab;\\
773: \>\>$\$2\$\to$test.falselab = $\$\$\to$test.falselab;\\
774: \>\>tDO($\$\%2\$$);\\
775: \>$\}$;\\
776: \end{tabbing}
777: \caption{Short circuited Boolean Evaluation}
778: \label{short:circuit}
779: \end{figure}
780: \addtocounter{figure}{-1}
781: \begin{figure}[tp]
782: \begin{tabbing}
783: \hspace{1in}\=\hspace{1in}\=\hspace{1in}\=\hspace{1in}\=
784: \hspace{1in}\=\hspace{1in}\=\hspace{1in}\=\hspace{1in}\=\kill
785: lTest:\>nNot(lTest)\\
786: \>$\{$ {\tt TOPDOWN}; $\}$\\
787: \>=$\{$\\
788: \>\>$\$1\$\to$test.truelab = $\$\$\to$test.falselab;\\
789: \>\>$\$1\$\to$test.falselab = $\$\$\to$test.truelab;\\
790: \>\>tDO($\$\%1\$$);\\
791: \>$\}$;\\[4pt]
792: lTest:\>nLess(lExpr,lExpr)\>\>$/*$ {\sl not topdown anymore } $*/$\\
793: \>=$\{$ $...$ {\sl generate code for comparison } $...$ $\}$;\\
794: \end{tabbing}
795: \caption{(cont.) Short circuited Boolean Evaluation}
796: \end{figure}
797:
798: \subsection{Code Generation}
799:
800: Figure~\ref{example} is an
801: example of \twiglang{} program that
802: can be used to generate VAX code for
803: the subtract instruction:
804:
805: \begin{figure}
806: \begin{tabbing}
807: \hspace{1in}\=\hspace{1in}\=\hspace{1in}\=\hspace{1in}\=
808: \hspace{1in}\=\hspace{1in}\=\hspace{1in}\=\hspace{1in}\=\kill
809: prologue \> $\{$\\
810: $/*$ otherdefs.h will have a type definition for {\tt NODEPTR} $*/$\\
811: \#include\>``otherdefs.h''\\
812: \#define\>{\tt TEMP\_COST}\>\>5\\
813: \#define\>{\tt SUB\_COST}\>\>30\\
814: \#define\>{\tt DEC\_COST}\>\>10\\
815: NODEPTR \> gettemp();\\
816: $\}$;\\[4pt]
817: node \> long constant sub;\\
818: label \> operand temp;\\[4pt]
819: operand: \> long;\>\>\>$/*$ rule 1 $*/$\\
820: operand:\>constant;\>\>\>$/*$ rule 2 $*/$\\
821: operand:\>temp;\>\>\>$/*$ rule 3 $*/$\\[4pt]
822: temp:\>operand\>\>\>$/*$ rule 4 $*/$\\
823: \>$\{$ cost = {\tt TEMP\_COST}+$\$\%1\$\to$cost;$\}$\\
824: \>$=\{$\\
825: \>\>NODEPTR t = gettemp();\\
826: \>\> emit(``mov'', $\$\$$, t, 0);\\
827: \>\>return(t);\\
828: \>$\}$;\\[4pt]
829: operand:\>sub(operand,operand)\>\>\>$/*$ rule 5 $*/$\\
830: \>$\{$cost = $\$\%1\$\to$cost + $\$\%2\$\to$cost+{\tt SUB\_COST};$\}$\\
831: \>=$\{$\\
832: \>\>NODEPTR t = gettemp();\\
833: \>\>emit(``sub'', $\$1\$$, $\$2\$$, t, 0);\\
834: \>\>return(t);\\
835: \>$\}$;\\[4pt]
836: temp:\>sub(temp,constant)\>\>\>$/*$ rule 6 $*/$\\
837: \>$\{$\\
838: \>\>if(value($\$2\$$)==1)\\
839: \>\>\>cost = $\$\%1\$\to$cost+{\tt DEC\_COST};\\
840: \>\>else ABORT;\\
841: \>$\}$=$\{$\\
842: \>\>emit(``dec'', $\$1\$$, 0);\\
843: \>\>return($\$1\$$);\\
844: \>$\}$;
845: \end{tabbing}
846: \caption{Small Code Generation Example}
847: \label{example}
848: \end{figure}
849:
850: \begin{itemize}
851: \item Rules 3 and 4 form a loop.
852: The potential loop: temp$\to$operand$\to$temp$\to$operand$...$ is
853: broken
854: by the matcher recognizing that the cost of the second match of temp
855: does not cost less than the first match of temp.
856:
857: \item In the cost clause of rule 5, the cost is the sum of the
858: leaves plus the cost of the subtract instruction.
859: The action clause emits code to add the two operands
860: and leave the result in a temporary location. The temporary is
861: returned as a substitution for the subject tree.
862:
863: \item Rule 6 handles a special case where the left operand is
864: already in a temporary and the constant is one. In this case, the
865: temporary is directly decremented and returned as the new tree.
866: \end{itemize}
867:
868: \section{How to run \Twigcomp{}}
869:
870: Once a user has written the specification, \twigcomp{} is used to compile it
871: into a subroutine.
872: Figure~\ref{overview} gives an overview of how \twigcomp{} does this.
873: In Figure~\ref{overview}, rounded boxes indicate files and sharp
874: cornered boxes
875: are processing programs. The user creates the \twiglang{} specification which
876: is represented by the top box. The specification file must have the
877: suffix ``{\tt .mt}''. Let's say we have one called {\tt arbor.mt}.
878: To invoke the \twigcomp{} compiler we type:
879:
880: \begin{verse}
881: {\bf twig} [--{\bf w$xx$}] arbor.mt
882: \end{verse}
883:
884: The proprocessor reads {\tt arbor.mt} and if there are no errors
885: will create two files: {\tt symbols.h} and {\tt walker.c}.
886: To build {\tt walker.c}, \twigcomp{} uses a template file called {\tt walker.$xx$}
887: where $xx$ is the argument of the optional --{\bf w} flag.
888: If the flag is omitted then $xx$ defaults to ``{\tt c1}''.
889:
890: {\tt Symbols.h} and {\tt walker.c}
891: can then compiled and linked with the other modules of the
892: user's program. These other modules should provide the
893: Tree and Cost ADTs described above.
894: {\tt Walker.h} is an auxilliary include file that is referenced by {
895: \tt walker.c}.
896: A typical compile command is:
897:
898: \begin{verse}
899: {\bf cc} -- {\bf I\sl include\_dir} {\bf -c} walker.c
900: \end{verse}
901:
902: The --{\bf I\sl include\_dir} argument causes {\bf cc} to look for
903: walker.h in {\sl include\_dir}. The exact value of {\sl include\_dir}
904: will depend on where the files are stored at your site.
905: The tree matcher is initialized by calling {\tt \_matchinit} and matching
906: can begin by calling {\tt \_match} from
907: the user program.
908:
909: \section{Organization of the \Twigcomp{} Compiler}
910: The preprocessor is written completely in C and YACC.
911: It can be roughly broken up into four modules:
912: \begin{itemize}
913: \item Lexer --- The lexer performs lexical analysis.
914: The input is tokenized by this module and passed to the parser
915: as required. Identifiers are looked up in the symbol table
916: and space is reserved for them if they have never been encountered.
917: \item Parser --- This part is written in YACC and recognizes the
918: \twiglang{} language. When a language construct is recognized actions
919: are invoked. Declarations, prologues and inserts will invoke symbol
920: table operations when recognized. Rules are broken up and stored
921: in the symbol table too. The tree patterns in the rules are unflattened
922: and passed to the machine builder.
923: \item Machine Builder --- This module takes tree patterns and incrementally
924: builds an internal representation of the matching automaton.
925: The builder takes this
926: tree and then enumerates each of the possible paths from the root to the
927: leaves. These paths are treated as strings and a string matching automaton
928: is built as described in \cite{fgrep} and \cite{treematch}.
929:
930: Inside the builder the partially built machine is kept in a linked list
931: form. Each machine state is a linked list and each node represents
932: a state transition. When the tables for the machine are written out into {
933: \tt walker.c}, the linked list structure is
934: translated into a table of sixteen bit integers.
935: Each transistion is stored as two integers. The first is
936: the integer corresponding to the symbol that causes the transistion and the
937: second is the index of the next state in the table.
938:
939: \item Symbol Table --- The central data structure in this symbol table is
940: the hash table.
941: Each entry in the hash table is a bucket --- a linked list of symbol table
942: entries. There is no rehashing. All colliding items are placed
943: in the linked list.
944: This is a simple
945: and adequately efficient arrangement. Depending on the type of symbols, the
946: entry may point to other data structures. Entries for $label\_id$s point to
947: lists of trees. $Node\_id$ entries record the integers that have been
948: associates with them. Other entries may point to data structures holding a
949: textual representation of C code that forms action and cost parts.
950: \end{itemize}
951:
952: Here is a list of the files that form the \twigcomp{} compiler's source
953: and their approximate function.
954: \begin{itemize}
955: \item {\tt common.h} --- This file contains data definitions for how
956: trees are represented in the parser. It also contains type definitions for
957: external functions.
958: \item {\tt code.h} --- This file contains definitions for how code fragments
959: are stored.
960: \item {\tt sym.h} --- This file defines the major data structures in the
961: symbol table.
962: \item {\tt machine.h} --- This file defines the how the matching automaton
963: is represented internally.
964: \item {\tt twig.y} --- This is the parser.
965: \item {\tt sym.c} --- This is the symbol table manager.
966: \item {\tt path.c} --- The path string enumerator.
967: \item {\tt machine.c} --- The machine builder.
968: \item {\tt lex.c } --- The lexer.
969: \item {\tt tree.c}, {\tt io.c}, and {\tt code.c} --- Miscellaneous
970: routines for manipulating trees, performing I/O and manipulating
971: code fragments.
972: \end{itemize}
973:
974: \section{Performance}
975:
976: So far, the only experience we have with \twiglang{}
977: is a VAX code generator written
978: for the pcc2 compiler. The \twiglang{} code generator is 25\% faster than
979: the original code generator.
980: The maintainability and modifiability of the code generator has
981: improved. For example adding the indexed addressing mode
982: into the code generator took only a few hours. The target code quality
983: is just as good as the original code generator.
984:
985: The \twigcomp{} table
986: generation algorithm is fast. For the VAX machine description which
987: consist of 109 rules, the generation time was 4.2 seconds on a 780.
988: The generation time could be increased
989: by two orders of magnitude
990: before other table driven system can compete with \twigcomp{} on this basis.
991: Furthermore, the tables are small. The VAX description is only 7.5K
992: and the text space for the walker is 30K.
993: Again this is much smaller than those of other table driven systems.
994:
995: \Twiglang{} is currently a research tool. Several things can be done to improve
996: \twiglang{}'s performance.
997: \begin{itemize}
998: \item \Twiglang{} does a lot of procedure calls. Every machine transition
999: require at least one procedure call. Contrast this to YACC where a machine
1000: transition is done in line.
1001: \item \Twiglang{} performs a very expensive closure operation with respect to
1002: unit rules. A unit rule is the analogue of a unit production and has the
1003: form:
1004:
1005: \begin{verse}
1006: $label_1$: $label_2$;
1007: \end{verse}
1008:
1009: \noindent This closure is done at run time.
1010: It requires many procedure calls and manipulates complex data structures.
1011: We are looking at ways of doing this
1012: at table generation time or to hash the results of the closures so that
1013: redundant calculations can be avoided. The problem with doing them
1014: in the table generator is the possible explosion in its
1015: running time.
1016:
1017: \item Many of the cost parts for the rules are similar and hence some
1018: tests are performed and recalculated many times
1019: on a node. \Twigcomp{} should be clever
1020: enough to factor out some of these tests and do them only once.
1021: However, the information required for \twigcomp{} to do this is not easily
1022: available. It is hidden in the C code fragments.
1023:
1024: \item The current compact
1025: representation of the matching automaton
1026: require linear searching except at the
1027: start state where the large branching factor compelled us to use an
1028: array scheme. Using another representation would speed up the
1029: matcher. For example, implementing the VAX description as an array is
1030: estimated to use about 50K bytes but may provide
1031: performance improvement.
1032:
1033: \end{itemize}
1034: \bibliography{mybib}
1035: \bibliographystyle{plain}
1036: \end{document}
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.