researchv10no/cmd/twig/doc/manual/twig0.tex - view

File: [Research Unix] / researchv10no / cmd / twig / doc / manual / twig0.tex
Revision 1.1.1.1 (vendor branch): download - view: text, annotated - select for diffs
Tue Apr 24 17:21:35 2018 UTC (8 years, 1 month ago) by root
Branches: belllabs, MAIN
CVS tags: researchv10, HEAD

researchv10 Norman

\documentstyle[titlepage,fullpage,12pt,clbiba]{article} \begin{document} \newcommand{\twigcomp}{{\tt twig}} \newcommand{\Twigcomp}{{\tt Twig}} \newcommand{\twiglang}{{\it twig}} \newcommand{\Twiglang}{{\it Twig}} \title{Twig Reference Manual} \date{ \begin{minipage}{.8\textwidth} \normalsize \noindent \Twiglang{} is a language for manipulating trees. A \twiglang{} program consists of a set of pattern-action rules together with associated declarations. Patterns describe trees to be matched. Actions calculate costs, perform tree manipulations and other functions such as emitting code. A \twiglang{} program is translated by the \twigcomp{} compiler into subroutines and tables in a host language. In the current implementation, the host language is C.\\[6pt] A \twiglang{} program manipulates trees by first finding a minimum cost covering of the input tree. The actions of the rules whose pattern parts composes the covering is then executed. The minimum cost covering is determined using dynamic programming. This technique naturally resolves many ambiguities that may be in the specifications.\\[6pt] The prime purpose of \twiglang{} is to create tree manipulation programs. One interesting application of tree manipulation is code generation and \twiglang{} has been used to implement a code generator for the pcc2 compiler on the VAX. \end{minipage}} \author{Steven W.K. Tjiang} \maketitle \tableofcontents\newpage \section{Introduction} \Twiglang{} is a language for writing tree manipulation programs. A \twiglang{} program does this by using a matcher to find a minimal cost covering of a tree. The covering is composed of user supplied templates. Trees are manipulated by executing the actions associated with the templates of their coverings. These actions may rewrite the tree, update other data structures, or generate output. Figure~\ref{overview} gives an overview of the \twiglang{} system. The round boxes in the figure indicates files and the sharp cornered boxes are processing programs. The user writes the \twiglang{} specification file. This file is then compiled by the \twigcomp{} compiler into host language files. In the current version of \twiglang{}, the host language is C. The host language files are then compiled and linked with other user supplied files. The resulting program represented by the large box at the bottom of figure~\ref{overview} is the tree manipulation program. \begin{figure}[hp] \input{pic} \caption{The {\bf Twig } system} \label{overview} \end{figure} The details of this box will be discussed later. \section{Motivation} Although \Twiglang{} is a language for manipulating trees it was originally motivated by pattern directed code generation. In many compilers, the typical structure is a {\sl front end} and a {\sl back end}. The front end reads the input source program, checks its syntax and contextual conditions, and builds an intermediate representation of the program. The language used to do this representation is often called the {\it intermediate representation} and will be abbreviated as IR. The back end takes the IR and translates it into the target machine code. That isn't the whole story but it's close. Both ends may be sets of programs rather than one program. There may also be a {\it middle} that performs transformations on the IR. The intention of these transformations is to improve the target code that comes out of the compiler. For this discussion, we can ignore the middle. IRs can vary greatly from compiler to compiler. However the IRs are implemented, they are encodings of graphs. Usually three types of graphs are evident\cite{dragon}. There is a flowgraph whose nodes are {\sl basic blocks} and the edges corresponds to transfer of control between the basic blocks. Basic blocks are not featureless. They are graphs themselves. Some compilers will represent them as a forest of {\it directed acyclic graphs} or DAGs. The nodes are data values and operations. Since operations yields values, one can think of nodes as corresponding to values. An edge exists from node $a$ to node $b$ if and only if the value corresponding to $a$ depends on the value corresponding to $b$. Other compilers will use trees instead of DAGs. Trees are not as general as DAGs and these compilers overcome that by using some type of labelling or copying. Writing a front end usually involves writing a recognizer for the source language, designing a symbol table, and determining an error recovery strategy. What has made this process easy are parser generators like YACC\cite{yacc}. They separate out the task of recognizing the source language from the other details. This makes the task intellectually more manageable. The compiler implementor now specifies the source language via grammar productions. Symbol table manipulations, and IR construction rules are attached to productions and activated when a source language construct is recognized. The front end is then {\it automatically} generated by processing the specification with a parser generator. The result is a high quality front end whose performance is close to that of a hand-crafted front end. Providing the same type of facilities for automatically building back ends has been much more difficult. Extending current front end parsers to do back ends has been workable but not entirely successful. The basic idea is that the back end implementor writes a {\it machine grammar} for the IR and code generation occurs as actions that are executed when patterns are found in the IR\cite{gana,henry}. Unfortunately, these grammars are large and ambiguous. Using parsers to handle them have usually met with many combinatorial problems \cite{henry}. Furthermore, these code generators are slow although their output is often of high quality. The software engineering advantages one see with parsers in the front end are not so evident in this circumstance. Instead of having to deal with the complexities of code generation, the implementor now has to deal with the complexities of the very large machine grammar. Another approach at automatically building back ends is to start with known good algorithms for generating code. From these algorithms, we can build an abstract interpreter and then specifications can be written to drive them. We hope that by doing this we can derive a much more natural code generation specification technique. \Twiglang{} is one possible result of this approach. The algorithms that form the basis of \twiglang{} are presented in \cite{optimalcode} and \cite{treematch}. \Twiglang{} specifications are much less complex than parser based specifications. Ambiguities are handled naturally and do not complicate the specifications. In \cite{optimalcode}, it is shown how optimal code can be generated for an expression tree given a description of a machine's instruction set. This description is in term of tree templates. The algorithm first determines a minimal cost covering of an expression tree using dynamic programming. The covering is composed from the templates in the description. The code generated is then the instructions represented by the templates in the covering. This algorithm was shown to generate the shortest code sequences for an abstract machine and its running time was shown to be linear in the number of nodes in the expression tree in \cite{optimalcode}. This is all very well in theory but what about in practice. Code length is not always the quantity that one wishes to minimize. One might wish to minimize the number of memory accesses or machine cycles executed by the generated code. Fortunately, the algorithm can also minimize these quantities. In fact, it can minimize any strictly increasing cost function. Let $I$ denote a machine instruction and $I_0I_1I_2...I_n$ denote a sequence of instructions. Then a cost function $f$ is strictly increasing if and only if $f(I_0I_1I_2...I_n) < f(I_0I_1I_2...I_nI_{n+1})$ for all instruction sequences $I_0I_1I_2...I_n$ and instruction $I_{n+1}$. Another shortcoming in \cite{optimalcode} is that no efficient algorithm is given for matching the templates in the tree. This is where \cite{treematch} comes in. In that paper, two algorithms are presented for tree matching. \Twiglang{} combines the work presented in the two papers. \Twiglang{} differs from the algorithm as presented in \cite{optimalcode} in the following ways: \begin{itemize} \item It provides a convenient specification language for the templates. \item Tree templates in \cite{optimalcode} were labelled with either {\sl Register} or {\sl Memory}. This label being the storage class that the instruction corresponding to the template would put a result. \Twiglang{} generalize tree labels so that they can be any symbol. \item Instead of associating an instruction with each template, \twiglang{} associates an action. \item In \cite{optimalcode}, code is generated by a bottom up traversal of the minimum cost covering. \Twiglang{} allows a top down traversal of the cover and also the ability for the actions to rewrite parts of the tree. \item The algorithm in \cite{optimalcode} determines the evaluation order which gives the optimal code. \Twiglang{} does not do this. The evaluation order determination was omitted based on the conjecture that most of the time, simple left to right evaluation of the leaves will yield near optimal results. In situations where a nonstandard evaluation order is required, a new \twiglang{} pattern could be added that codes the order explicitly (see Section~\ref{exec:phase}). \end{itemize} In this manual, a brief discussion will be given of the \twiglang{} language and how \twiglang{} programs can be incorporated with C. Many statements have been made in this manual without proof. If the reader wishes to pursue it further, more details about the algorithms will be presented in an upcoming paper, \cite{aho:gana:tjiang}. \section{The \Twiglang{} Language} \subsection{Lexical Issues and Conventions} Currently the following identifiers are reserved keywords of \twiglang{}: \begin{tabbing} \hspace{1in}\=\hspace{1in}\=\hspace{1in}\=\kill action \> cost \> insert \\ label \> node \> prologue \end{tabbing} \noindent ;:(),= are special characters. All blanks, tabs, formfeeds and newlines are ignored by \twiglang{} but they may not appear inside identifiers and numbers. Identifiers are nonempty strings made of letters, digits or underscores and starting with a letter. Numbers are nonempty string of digits. Code fragments or action parts are enclosed in braces. Inside code fragments, C lexical rules must be obeyed except that strings of the form $\$...\$$ that are not inside C strings have special meaning to \twiglang{}. In the following sections, $id$ denotes an identifier; $int_1$, $int_2$ denotes numbers; $ccode$ denotes C code fragments; $...$ indicates repetition of the previous item; $[...]$ indicates that $...$ is optional. The input to the \twiglang{} program will be referred to as the subject tree. \subsection{Rules} The fundamental unit of a \twiglang{} program is a rule. \begin{verse} $label\_id$ {\bf :} $tree\_pattern$\quad {\bf [} $cost$ {\bf ]} {\bf [ =} $action$ {\bf ]} \end{verse} \noindent Intuitively, the pattern is used to match a subtree. The $cost$ code fragment is then evaluated. The resulting cost is recorded by the matcher for use in dynamic programming. The $action$ is executed if the rule is part of the least cost cover of the tree. The details of pattern matching will be discussed in Section~\ref{pattern}. If the $cost$ part is missing, \twigcomp{} will insert default code that returns the special value, {\tt DEFAULT\_COST}. A missing $action$ part indicates that nothing will be done when a match is found. \subsection{Tree Patterns} \label{TreePatterns} Tree patterns are specified in prefix parenthesized form and can be described by the following BNF: \begin{verse} $tree\_pattern\to node\_id\;\vert\;label\_id$\\ $tree\_pattern\to node\_id \;{\bf (}\; tree\_list \;{\bf )}$\\ $tree\_list \to tree \;{\bf ,}\; tree\_list \;\vert\; tree$\\ \end{verse} \begin{figure} \unitlength 0.1in \begin{center} \begin{picture}(25,16) \put(7,0){\framebox(4,3){\tt id}} \put(20,0){\framebox(5,3){\tt const}} \put(13,6){\framebox(5,3){\tt plus}} \put(6,12){\framebox(5,3){\tt plus}} \put(0,6){\framebox(4,3){\tt id}} \put(11,3){\line(2,3){2}} \put(20,3){\line(-2,3){2}} \put(13,9){\line(-2,3){2}} \put(4,9){\line(2,3){2}} \end{picture} \end{center} \vspace{0.5in} \begin{verse} \begin{center} \tt plus(id, plus(id, const)) \end{center} \end{verse} \caption{A Tree and its prefixed form} \label{treepattern} \end{figure} \noindent Figure~\ref{treepattern} gives an example of a tree pattern and its prefix parenthesized form. There are two types of symbols: $node\_id$\/s and $label\_id$\/s. $Node\_id$\/s are used to denote internal nodes and leaves. $Label\_id$\/s label tree patterns and are analogous to nonterminals in context free grammars. For example, the following \twiglang{} rules without their action parts describe simple expression trees with the {\tt plus} operator. \begin{verse} {\tt\begin{tabbing} expr: \= plus(expr, expr)\\ expr: \> identifier\\ expr: \> constant \end{tabbing}} \end{verse} \noindent Here, {\tt identifier} and {\tt constant} are $node\_id$\/s representing leaves, and {\tt plus} is a $node\_id$ representing an internal node whereas {\tt expr} is a $label\_id$. Pattern leaves that are $label\_id$\/s are called {\sl labelled leaves}. In the first rule, both leaves of the pattern are labelled. \Twigcomp{} associates an integer with each node symbol and label symbol. These integers are used by the \twiglang{} pattern matcher as encodings for the symbols. As the matcher traverses the tree, a user supplied subroutine is called to provide an integer corresponding to each node. \subsection{Costs} To increase the flexibility of representing costs, the tree matcher views costs as an {\sl abstract data type} or ADT. For example, costs may be represented as an integer or as a vector of integers with each element representing the cost of a specific resource. A cost ADT suitable for \twiglang{} must have the following four definitions: \begin{itemize} \item {\tt COST} is a C type. Although the proper functioning of the tree matcher does not depend on the internal details of the {\tt COST} type, it must have the type information for storage allocation purposes. \item {\tt INFINITY} is the maximum attainable cost value. The matcher uses {\tt INFINITY} to initialize its data structures. \item {\tt DEFAULT\_COST} is the cost value returned by rules without a cost part. \item {\tt COSTLESS} is a function of two cost values. It must return true if and only if the first argument is less than the second. \end{itemize} \subsection{Trees} As with costs, trees are treated as an ADT. Here, using an ADT is even more important because trees come in a variety of shapes and representations. \Twiglang{} would be overly complicated if it had to know any more than the fundamental properties of trees. Thus, \twiglang{} treats trees as an acyclic directed graph of almost featureless nodes with one distinguished node, the root. Each node has only one feature and that is an integer corresponding to the $node\_id$\/s discussed above. To provide this view to \twiglang{}, the user must provide the following definitions and subroutines. \begin{itemize} \item {\tt NODEPTR} is the type of a node. The actual details of the {\tt NODEPTR} are irrelevant and \twigcomp{} uses this definition only for storage allocation and declaration purposes. \item {\tt NULL} is used to denote a null {\tt NODEPTR} value. In the current implementation this is the same {\tt NULL} used by the C standard I/O package and need not be defined by the user. This restricts the {\tt NODEPTR} representation to be a pointer. \item {\tt mtGetNodes$(r,n)$} returns the $n$th child of $r$ where $r$ is a {\tt NODEPTR}. The leftmost child is {\tt mtGetNodes}$(r,1)$. If $n > \hbox{degree}(r)$ then the function should return {\tt NULL}. \Twigcomp{} expects the expression {\tt mtGetNodes}$({\tt NULL}, 0)$ to denote the root node of the subject tree. \item {\tt mtValue$(r)$} returns the integer associated with $r$ (See section~\ref{TreePatterns}). \item {\tt mtSetNodes$(r,n,c)$} replaces the $n$th child of $r$ with $c$. This routine may be used to replace whole subtrees with others. {\tt mtSetNodes}$({\tt NULL},0,c)$ is used by \twigcomp{} to set the root of the subject tree to $c$. \end{itemize} \subsection{Prologue and Inserts} \begin{verse} {\tt prologue} $ccode$; \end{verse} \noindent signals to \twigcomp{} that $ccode$ should be inserted at the beginning of the output source file. $Ccode$ should contain definitions relevant to the C code in rules that are defined later in the \twiglang{} source file. \begin{verse} {\tt insert} $ccode$; \end{verse} \noindent causes $ccode$ to be placed into the source file. There can be multiple inserts and they will be placed into the source file in the order that they appear. \subsection{Declarations} All \twiglang{} identifiers are declared before they are used. \begin{verse} {\tt node $id[(int_1)] [= int_2] ...$;\hfil} \end{verse} \noindent A node declaration declares one or more identifier to be associated with nodes of the subject tree. The identifiers are assigned numbers by \twigcomp{} but the user can overide the assigned number by specifying $int_2$. The degree of the node identifer, the number of children, is assumed to be fixed. Normally, \twigcomp{} will deduce the degree when $id$ is used in a rule. However, the user may explicitly give the degree by specifying $int_1$. \begin{verse} {\tt label $id ...$;} \end{verse} A label declaration indicates that the following $id$'s are to be used as labels of rules. \subsection{Costs and Action code} Code fragments such as the $cost$ and $action$ clauses of a rule are specfied by enclosing C code in braces. All legal C constructs are permitted in code fragments. In addition, the following are provided for convenient access to the subject tree and internal data structures of the matcher. \begin{itemize} \item $\$\%n\$$ denotes a pointer value to the matcher data structure for the $n$th labelled leaf. Thus to access the cost value associated with that leaf, the notation $\$\%n\$\to${\tt cost} may be used. \item $\$\$$ denotes a pointer value to the root of the subject tree. \item $\$n_1.n_2.n_3.\, \hbox{...}\,.n_{k-1}.n_k\$$ denotes a pointer value to the $n_k$th child of the $n_{k-1}th$ child of the $n_{k-2}$th child of \hbox{...} of the $n_1$th child of the root of the subject tree. Each $n_i$ is a positive non-zero integer. \end{itemize} Some special constructs are available to code fragments in the cost part of the specification. The statement ``{\tt ABORT;}'' when encountered during the execution of the cost code, signals to the matcher that this pattern is to be rejected. When a ``{\tt REWRITE;}'' statement is encountered, control is returned to the matcher and the rule will become a {\sl rewrite} mode match. When the end of the cost code fragment is reached, control is returned to the matcher and the rule becomes a {\sl defer} mode match. The statement ``{\tt TOPDOWN;}'' is like ``{\tt REWRITE;}'' except that the mode of the match becomes {\sl topdown}. The meanings of these modes will be explained in Section~\ref{pattern}. Cost values are returned to the matcher by assigning to the ``{\tt cost}'' variable in the cost clause. If no assignment is made to the {\tt cost} variable then the returned cost will be {\tt DEFAULT\_COST}. Action clauses are expected to return a new tree which will replace the subject tree. This is done by returning using the ``{\tt return($new\_tree$);}'' statement. where {\tt $new\_tree$} is of type {\tt NODEPTR}. If execution reaches the end of the action clause, the matcher resumes execution and the subject tree is not modified. \section{Pattern Matcher Operation} \label{pattern} The pattern matcher operates in two phases: the costing phase and the execution phase. During the costing phase, a minimal cost covering of the subject tree is found. The execution phase invokes the actions that are associated with the patterns making up the covering. Execution phases may begin during the costing phase to execute the covering of a subtree as described in Section~\ref{exec:phase}. \subsection{The Costing Phase} \label{cost:phase} Let the patterns of the \twiglang{} specifications be interpreted as grammar productions with {\sl label\_id}s as non-terminals and {\sl node\_id}s as terminals. If we add a production $S\,\to\,\hbox{\sl label\_id}$ for all {\sl label\_id}s where $S$ is the start symbol, then a covering for a tree is analogous to a derivation for the prefixed form of the tree. Figure~\ref{gram:interp} shows some patterns and its corresponding grammar. \begin{figure} \begin{center} \begin{verse} \tt e: plus(e,e)\\ e: plus(e, plus(e,e))\\ e: id\\ e: const\\ \end{verse} \end{center} \vspace{0.3in} \begin{center}\begin{verse} $S\to e$\\ $e\to plus(e,e)$\\ $e\to plus(e,plus(e,e)$\\ $e\to id$\\ $e\to const$\\ \end{verse}\end{center} \caption{Pattern and Grammar} \label{gram:interp} \end{figure} The productions of the form $S\to\hbox{\sl label\_id}$ reflect the fact that any {\sl label\_id} may start a cover. The matcher finds the least cost cover by doing a preorder traversal of the subject tree. At the same time, it builds a skeleton tree that is structurally isomorphic to the subject tree. When a match is discovered the cost clause of the pattern is invoked to calculate the cost. Many patterns with different labels could match at any given node but only the lowest cost pattern for each label is recorded in the skeleton. When a pattern is matched, its label is then used as input to the pattern matcher so that matching of patterns with labelled leaves can begin. This process is analogous to a reduce action in bottom up parsers. \subsection{The Execution Phase} \label{exec:phase} The execution phase starts when the costing phase is complete or when a sufficiently low cost {\sl rewrite} mode rule is encountered. Let $M$ be a matching rule and $L_1, L_2, L_3, ... ,L_n$ be the labelled leaves of the matching pattern. The following procedure is used to execute the action parts of $M$. \vbox{ \vspace{2ex} \noindent {\bf Procedure \it Execute} \begin{itemize} \item If $M$ is a {\sl deferred} mode match then execution occurs by first applying {\it Execute} to the $L_i$s from left to right. That is, in the order that they appear in the prefix form of the tree. When all the $L_i$s are executed, the action part of $M$ is invoked. \item If $M$ is a {\sl rewrite } mode match then just the action part of $M$ is executed. When the action part returns the matcher deletes the skeleton corresponding to the subtree that $M$ matched and rescans the new subtree that may have been substituted by executing $M$. \item If $M$ is a {\sl topdown} mode match then only the action part of $M$ is executed. To execute a labelled leaves, $L_i$, $M$'s action part may do so explicitly by using the {\tt tDO($\$\%i\$$)} macro. This allows the user to arbitrary order the execution of the labelled leaves. \end{itemize} } \noindent When the costing phase is over, the execution phase is started by picking the lowest cost match at the root of the subject tree. This will be the root of the minimum cost cover. The match is executed as described above. The execution phase may begin before the costing phase is over. This occurs when a rewrite rule matches and its cost is lower than all other matches at that subtree. In this situation the rewrite rule is executed immediately and the new rewritten subtree is scanned once more. The prescence of rewrite rules throws a wrench in the theoretical niceties of the matching algorithm. For example, there is now no guarantee that the algorithm will terminate because the tree can be repeatedly enlarged. Rewrite rules can be dangerous. They can be triggered unintentionally if the user is not careful to abort them in situations where they are not wanted. However, if used with care they can help to reduce the size of the \twiglang{} specification. \section{Some Examples} \subsection{Expression Trees to Prefix} A \twiglang{} program to print out the prefix form of expression parse trees is shown in Figure~\ref{trees:prefix}. \begin{itemize} \item The rules do not have cost calculations. Since there are no ambiguities costs are not necessary. \item The second rule is a topdown mode rule. This is essential in printing the prefix form. If it was omitted the postfix form would be printed. \item Before any matching can begin {\tt \_matchinit} must be called. Trees can then be processed by calling {\tt \_match} after arranging that the call {\tt mtGetNodes}$({\tt NULL},0)$ returns the root of the subject tree. \end{itemize} \begin{figure}[tp] \begin{tabbing} \hspace{1in}\=\hspace{1in}\=\hspace{1in}\=\hspace{1in}\= \hspace{1in}\=\hspace{1in}\=\hspace{1in}\=\hspace{1in}\=\kill prologue \> $\{$\\[4pt] typedef \> struct node $*$\>\>{\tt NODEPTR};\\ \#define \> {\tt COST} \>\> int\\ \#define \> {\tt INFINITY}\>\> 100000\\ \#define \> {\tt DEFAULT\_COST} \>\> 0\\[4pt] {\tt NODEPTR} Root;\\ struct code \> $\{$\\ \> char op; \> \> $/*$ null if node is a leaf $*/$\\ \> NODEPTR left, right;\\ \> char $*$id;\\ $\}$;\\[4pt] $\}$;\\[6pt] node \> nOp nIdent;\\ label \> lExpr;\\[6pt] lExpr:\> nIdent\\ \>= $\{$ printf(``$\%$s'', id); $\}$;\\[4pt] lExpr:\>nOp(lExpr,lExpr)\\ \>$\{$ {\tt TOPDOWN};$\}$\\ \>= $\{$\\ \>\>putchar($\$\$\to$op);\\ \>\>tDO($\$\%1\$$);\\ \>\>tDO($\$\%2\$$);\\ \>$\}$;\\[6pt] insert\>$\{$\\[4pt] mtValue(root)\\ \>{\tt NODEPTR} root;\\ $\{$\\ \>if(root$\to$op==0)\\ \>\>return(nIdent);\\ \>else\\ \>\>return(nOp);\\ $\}$\\[4pt] \end{tabbing} \caption{Printing Expression trees to Prefix form} \end{figure} \addtocounter{figure}{-1} \begin{figure}[tp] \begin{tabbing} \hspace{1in}\=\hspace{1in}\=\hspace{1in}\=\hspace{1in}\= \hspace{1in}\=\hspace{1in}\=\hspace{1in}\=\hspace{1in}\=\kill mtGetNodes(r,n)\\ \>{\tt NODEPTR} r;\\ \>int n;\\ $\{$\\ \>if(n==1)\\ \>\>return(r$\to$left);\\ \>else if(n==1)\\ \>\>return(r$\to$right);\\ \>else if(n==0 $\&\&$ r=={\tt NULL})\\ \>\>return(Root);\\ \>else return({\tt NULL});\\ $\}$\\[4pt] mtSetNodes(r,n, c)\\ \>{\tt NODEPTR} r, c;\\ \>int n;\\ $\{$\\ \>if(n==1)\\ \>\>r$\to$left = c;\\ \>else if(n==2)\\ \>\>r$\to$right = c;\\ \>else if(n==0)\\ \>\>Root = c;\\ $\}$\\[4pt] main(argc,argv)\>\>\>$/*$ called by user only $*/$\\ \>char $**$argv;\\ $\{$\\ \>\_matchinit();\>\>$/*$ initialize the matcher $*/$\\ \>$...$ {\sl get a tree and set Root to it}\\ \>\_match();\>\>$/*$ do the match $*/$\\ $\}$\\ $\}$\\ \end{tabbing} \caption{(cont.) Printing Expression trees to Prefix form} \label{trees:prefix} \end{figure} \subsection{Short Circuited Boolean Evaluation} Short circuited boolean evaluation is naturally topdown. A fragment of \twiglang{} program that implements this is shown in Figure~\ref{short:circuit}. \begin{itemize} \item Labels for true and false branches are passed down to rules below by putting them into the nodes. Another way of accomplishing this is to use an auxillary argument stack. \item The rules assume that every test generates a branch to both the false and true label. On conventional machines, extra branches can be eliminated by recognizing that one could generate code that falls through to its true or false labels. This can be done by separating {\sl lTest} into {\sl lTrueTest} and {\sl lFalseTest} which branches only when true and false respectively. Rules can then be written to take advantage of this. \end{itemize} \begin{figure} \begin{tabbing} \hspace{1in}\=\hspace{1in}\=\hspace{1in}\=\hspace{1in}\= \hspace{1in}\=\hspace{1in}\=\hspace{1in}\=\hspace{1in}\=\kill prologue \> $\{$\\[4pt] \#include\>``otherdefs.h''\\ union node $\{$\\ \> $...$ \>\>$/*$ definition of other node types $*/$\\ \> union $\{$\\ \>\> int operation;\\ \>\> {\tt NODEPTR} left, right;\\ \>\> {\tt LABEL} falselab, truelab;\>\>$/*$ {\tt LABEL} will defined elsewhere $*/$\\ \>\> $...$ \>\> $/*$ other definitions relevant to {\sl test nodes} $*/$\\ \> $\}$ test; \> $...$ \>\>$/*$ definition of other node types $*/$\\ $\}$;\\[4pt] $\}$;\\[6pt] node\>nAnd nOr nNot nLess;\\ label\>lTest lExpr;\\[6pt] lTest:\>nAnd(lTest,lTest)\\ \>$\{$ {\tt TOPDOWN}; $\}$\\ \>=$\{$\\ \>\>$\$1\$\to$test.falselab = $\$\$\to$test.falselab;\\ \>\>$\$1\$\to$test.truelab = getlabel();\\ \>\>tDO($\$\%1\$$);\\ \>\>printlab($\$1\$\to$test.truelab);\\ \>\>$\$2\$\to$test.falselab = $\$\$\to$test.falselab;\\ \>\>$\$2\$\to$test.truelab = $\$\$\to$test.truelab;\\ \>\>tDO($\$\%2\$$);\\ \>$\}$;\\[4pt] lTest:\>nOr(lTest,lTest)\\ \>$\{$ {\tt TOPDOWN}; $\}$\\ \>=$\{$\\ \>\>$\$1\$\to$test.falselab = getlabel();\\ \>\>$\$1\$\to$test.truelab = $\$\$\to$test.truelab;\\ \>\>tDO($\$\%1\$$);\\ \>\>$\$2\$\to$test.truelab = $\$\$\to$test.truelab;\\ \>\>$\$2\$\to$test.falselab = $\$\$\to$test.falselab;\\ \>\>tDO($\$\%2\$$);\\ \>$\}$;\\ \end{tabbing} \caption{Short circuited Boolean Evaluation} \label{short:circuit} \end{figure} \addtocounter{figure}{-1} \begin{figure}[tp] \begin{tabbing} \hspace{1in}\=\hspace{1in}\=\hspace{1in}\=\hspace{1in}\= \hspace{1in}\=\hspace{1in}\=\hspace{1in}\=\hspace{1in}\=\kill lTest:\>nNot(lTest)\\ \>$\{$ {\tt TOPDOWN}; $\}$\\ \>=$\{$\\ \>\>$\$1\$\to$test.truelab = $\$\$\to$test.falselab;\\ \>\>$\$1\$\to$test.falselab = $\$\$\to$test.truelab;\\ \>\>tDO($\$\%1\$$);\\ \>$\}$;\\[4pt] lTest:\>nLess(lExpr,lExpr)\>\>$/*$ {\sl not topdown anymore } $*/$\\ \>=$\{$ $...$ {\sl generate code for comparison } $...$ $\}$;\\ \end{tabbing} \caption{(cont.) Short circuited Boolean Evaluation} \end{figure} \subsection{Code Generation} Figure~\ref{example} is an example of \twiglang{} program that can be used to generate VAX code for the subtract instruction: \begin{figure} \begin{tabbing} \hspace{1in}\=\hspace{1in}\=\hspace{1in}\=\hspace{1in}\= \hspace{1in}\=\hspace{1in}\=\hspace{1in}\=\hspace{1in}\=\kill prologue \> $\{$\\ $/*$ otherdefs.h will have a type definition for {\tt NODEPTR} $*/$\\ \#include\>``otherdefs.h''\\ \#define\>{\tt TEMP\_COST}\>\>5\\ \#define\>{\tt SUB\_COST}\>\>30\\ \#define\>{\tt DEC\_COST}\>\>10\\ NODEPTR \> gettemp();\\ $\}$;\\[4pt] node \> long constant sub;\\ label \> operand temp;\\[4pt] operand: \> long;\>\>\>$/*$ rule 1 $*/$\\ operand:\>constant;\>\>\>$/*$ rule 2 $*/$\\ operand:\>temp;\>\>\>$/*$ rule 3 $*/$\\[4pt] temp:\>operand\>\>\>$/*$ rule 4 $*/$\\ \>$\{$ cost = {\tt TEMP\_COST}+$\$\%1\$\to$cost;$\}$\\ \>$=\{$\\ \>\>NODEPTR t = gettemp();\\ \>\> emit(``mov'', $\$\$$, t, 0);\\ \>\>return(t);\\ \>$\}$;\\[4pt] operand:\>sub(operand,operand)\>\>\>$/*$ rule 5 $*/$\\ \>$\{$cost = $\$\%1\$\to$cost + $\$\%2\$\to$cost+{\tt SUB\_COST};$\}$\\ \>=$\{$\\ \>\>NODEPTR t = gettemp();\\ \>\>emit(``sub'', $\$1\$$, $\$2\$$, t, 0);\\ \>\>return(t);\\ \>$\}$;\\[4pt] temp:\>sub(temp,constant)\>\>\>$/*$ rule 6 $*/$\\ \>$\{$\\ \>\>if(value($\$2\$$)==1)\\ \>\>\>cost = $\$\%1\$\to$cost+{\tt DEC\_COST};\\ \>\>else ABORT;\\ \>$\}$=$\{$\\ \>\>emit(``dec'', $\$1\$$, 0);\\ \>\>return($\$1\$$);\\ \>$\}$; \end{tabbing} \caption{Small Code Generation Example} \label{example} \end{figure} \begin{itemize} \item Rules 3 and 4 form a loop. The potential loop: temp$\to$operand$\to$temp$\to$operand$...$ is broken by the matcher recognizing that the cost of the second match of temp does not cost less than the first match of temp. \item In the cost clause of rule 5, the cost is the sum of the leaves plus the cost of the subtract instruction. The action clause emits code to add the two operands and leave the result in a temporary location. The temporary is returned as a substitution for the subject tree. \item Rule 6 handles a special case where the left operand is already in a temporary and the constant is one. In this case, the temporary is directly decremented and returned as the new tree. \end{itemize} \section{How to run \Twigcomp{}} Once a user has written the specification, \twigcomp{} is used to compile it into a subroutine. Figure~\ref{overview} gives an overview of how \twigcomp{} does this. In Figure~\ref{overview}, rounded boxes indicate files and sharp cornered boxes are processing programs. The user creates the \twiglang{} specification which is represented by the top box. The specification file must have the suffix ``{\tt .mt}''. Let's say we have one called {\tt arbor.mt}. To invoke the \twigcomp{} compiler we type: \begin{verse} {\bf twig} [--{\bf w$xx$}] arbor.mt \end{verse} The proprocessor reads {\tt arbor.mt} and if there are no errors will create two files: {\tt symbols.h} and {\tt walker.c}. To build {\tt walker.c}, \twigcomp{} uses a template file called {\tt walker.$xx$} where $xx$ is the argument of the optional --{\bf w} flag. If the flag is omitted then $xx$ defaults to ``{\tt c1}''. {\tt Symbols.h} and {\tt walker.c} can then compiled and linked with the other modules of the user's program. These other modules should provide the Tree and Cost ADTs described above. {\tt Walker.h} is an auxilliary include file that is referenced by { \tt walker.c}. A typical compile command is: \begin{verse} {\bf cc} -- {\bf I\sl include\_dir} {\bf -c} walker.c \end{verse} The --{\bf I\sl include\_dir} argument causes {\bf cc} to look for walker.h in {\sl include\_dir}. The exact value of {\sl include\_dir} will depend on where the files are stored at your site. The tree matcher is initialized by calling {\tt \_matchinit} and matching can begin by calling {\tt \_match} from the user program. \section{Organization of the \Twigcomp{} Compiler} The preprocessor is written completely in C and YACC. It can be roughly broken up into four modules: \begin{itemize} \item Lexer --- The lexer performs lexical analysis. The input is tokenized by this module and passed to the parser as required. Identifiers are looked up in the symbol table and space is reserved for them if they have never been encountered. \item Parser --- This part is written in YACC and recognizes the \twiglang{} language. When a language construct is recognized actions are invoked. Declarations, prologues and inserts will invoke symbol table operations when recognized. Rules are broken up and stored in the symbol table too. The tree patterns in the rules are unflattened and passed to the machine builder. \item Machine Builder --- This module takes tree patterns and incrementally builds an internal representation of the matching automaton. The builder takes this tree and then enumerates each of the possible paths from the root to the leaves. These paths are treated as strings and a string matching automaton is built as described in \cite{fgrep} and \cite{treematch}. Inside the builder the partially built machine is kept in a linked list form. Each machine state is a linked list and each node represents a state transition. When the tables for the machine are written out into { \tt walker.c}, the linked list structure is translated into a table of sixteen bit integers. Each transistion is stored as two integers. The first is the integer corresponding to the symbol that causes the transistion and the second is the index of the next state in the table. \item Symbol Table --- The central data structure in this symbol table is the hash table. Each entry in the hash table is a bucket --- a linked list of symbol table entries. There is no rehashing. All colliding items are placed in the linked list. This is a simple and adequately efficient arrangement. Depending on the type of symbols, the entry may point to other data structures. Entries for $label\_id$s point to lists of trees. $Node\_id$ entries record the integers that have been associates with them. Other entries may point to data structures holding a textual representation of C code that forms action and cost parts. \end{itemize} Here is a list of the files that form the \twigcomp{} compiler's source and their approximate function. \begin{itemize} \item {\tt common.h} --- This file contains data definitions for how trees are represented in the parser. It also contains type definitions for external functions. \item {\tt code.h} --- This file contains definitions for how code fragments are stored. \item {\tt sym.h} --- This file defines the major data structures in the symbol table. \item {\tt machine.h} --- This file defines the how the matching automaton is represented internally. \item {\tt twig.y} --- This is the parser. \item {\tt sym.c} --- This is the symbol table manager. \item {\tt path.c} --- The path string enumerator. \item {\tt machine.c} --- The machine builder. \item {\tt lex.c } --- The lexer. \item {\tt tree.c}, {\tt io.c}, and {\tt code.c} --- Miscellaneous routines for manipulating trees, performing I/O and manipulating code fragments. \end{itemize} \section{Performance} So far, the only experience we have with \twiglang{} is a VAX code generator written for the pcc2 compiler. The \twiglang{} code generator is 25\% faster than the original code generator. The maintainability and modifiability of the code generator has improved. For example adding the indexed addressing mode into the code generator took only a few hours. The target code quality is just as good as the original code generator. The \twigcomp{} table generation algorithm is fast. For the VAX machine description which consist of 109 rules, the generation time was 4.2 seconds on a 780. The generation time could be increased by two orders of magnitude before other table driven system can compete with \twigcomp{} on this basis. Furthermore, the tables are small. The VAX description is only 7.5K and the text space for the walker is 30K. Again this is much smaller than those of other table driven systems. \Twiglang{} is currently a research tool. Several things can be done to improve \twiglang{}'s performance. \begin{itemize} \item \Twiglang{} does a lot of procedure calls. Every machine transition require at least one procedure call. Contrast this to YACC where a machine transition is done in line. \item \Twiglang{} performs a very expensive closure operation with respect to unit rules. A unit rule is the analogue of a unit production and has the form: \begin{verse} $label_1$: $label_2$; \end{verse} \noindent This closure is done at run time. It requires many procedure calls and manipulates complex data structures. We are looking at ways of doing this at table generation time or to hash the results of the closures so that redundant calculations can be avoided. The problem with doing them in the table generator is the possible explosion in its running time. \item Many of the cost parts for the rules are similar and hence some tests are performed and recalculated many times on a node. \Twigcomp{} should be clever enough to factor out some of these tests and do them only once. However, the information required for \twigcomp{} to do this is not easily available. It is hidden in the C code fragments. \item The current compact representation of the matching automaton require linear searching except at the start state where the large branching factor compelled us to use an array scheme. Using another representation would speed up the matcher. For example, implementing the VAX description as an array is estimated to use about 50K bytes but may provide performance improvement. \end{itemize} \bibliography{mybib} \bibliographystyle{plain} \end{document}

This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.