researchv10no/cmd/lcc/doc/overview.tex - view

File: [Research Unix] / researchv10no / cmd / lcc / doc / overview.tex
Revision 1.1.1.1 (vendor branch): download - view: text, annotated - select for diffs
Tue Apr 24 17:21:35 2018 UTC (8 years, 1 month ago) by root
Branches: belllabs, MAIN
CVS tags: researchv10, HEAD

researchv10 Norman

\documentstyle{article} \begin{titlepage} \normalsize \vspace*{.7in} \begin{center} \begin{tabular}{c} \bf\Large A Retargetable Compiler for ANSI C \\[.5in] Christopher W. Fraser \\ \em AT\&T Bell Laboratories, 600 Mountain Avenue, \\ \em Murray Hill, NJ 07974 \\[1ex] and \\[1ex] David R. Hanson \\ \em Department of Computer Science, Princeton University, \\ \em Princeton, NJ 08544 \\[.7in] Research Report CS-TR-303-91 \\[1ex] February 1991 \\[.5in] \end{tabular} \end{center} \begin{abstract} \normalsize \verb|lcc| is a new retargetable compiler for ANSI C. Versions for the VAX, Motorola 68020, SPARC, and MIPS are in production use at Princeton University and at AT\&T Bell Laboratories. With a few exceptions, little about \verb|lcc| is unusual --- it integrates several well engineered, existing techniques --- but it is smaller and faster than most other C compilers, and it generates code of comparable quality. \verb|lcc|'s target-independent front end performs a few simple, but effective, optimizations that contribute to good code; examples include simulating register declarations and partitioning switch statement cases into dense tables. It also implements target-independent function tracing and expression-level profiling. \end{abstract} \end{titlepage} \begin{document} %\bibliographystyle{abbrv} \setcounter{secnumdepth}{0} \section{Introduction} \verb|lcc| is a new retargetable compiler for ANSI C~\cite{ansi:Cstandard}. It has been ported to the VAX, Motorola 68020, SPARC, and MIPS R3000, and it is in production use at Princeton University and at AT\&T Bell Laboratories. When used with a compliant preprocessor and library, \verb|lcc| passes the conformance section of Version 2.00 of the Plum-Hall Validation Suite for ANSI~C.% \footnote{The {\tt lcc} front end and a sample code generator are available for anonymous {\tt ftp} from {\tt princeton.edu}. The file {\tt README} in the directory {\tt pub/lcc} gives details. It also describes the current availability of {\tt lcc}'s production code generators.} Other reports describe \verb|lcc|'s storage manager~\cite{hanson90}, intermediate language~\cite{fraser:hanson:91a}, code generator~\cite{fraser:sigplan89}, and register manager~\cite{fraser:hanson:92}. This report surveys the remaining features of \verb|lcc| that may interest some readers. Chief among these are its performance, some aspects of its design that support this performance, and some features for debugging and profiling user code. \section{Design} With a few exceptions, \verb|lcc| uses well established compiler techniques. The front end performs lexical, syntactic, and semantic analysis, and some machine-independent optimizations, which are described below. Both the lexical analyzer and the recursive-descent parser are hand-written. Theoretically, this approach complicates both future changes and fixing errors, but accommodating change is less important for a standardized language like ANSI~C, and there have been few lexical or syntactic errors. Indeed, less than 15~percent of \verb|lcc|'s code concerns parsing, and the error rate in that code is negligible. Despite its theoretical prominence, parsing is a relatively minor component in \verb|lcc| as in other compilers; semantic analysis, optimization, and code generation are the major components and account for most of the code and most of the errors. The target-independent front end and a target-dependent back end are packaged as single program, tightly coupled by a compact, efficient interface. The interface consists of a few shared data structures, 17 functions, and a 36-operator dag language. The functions emit function prologues, define globals, emit data, etc., and most are simple. The dag language encodes the executable code from a source program; it corresponds to the ``intermediate language'' used in other compilers, but it is smaller than typical intermediate languages. Reference~\cite{fraser:hanson:91a} describes the interface. Code generators are generated automatically from compact, rule-based specifications~\cite{fraser:sigplan89}. Some rules rewrite intermediate code as naive assembly code. Others peephole-optimize the result. They are compiled into a monolithic hard-coded program that accepts dags annotated with intermediate code, and generates, optimizes, and emits code for the target machine. Hard-coding contributes significantly to \verb|lcc|'s speed. Table~\ref{linecounts} shows the number of lines in each of \verb|lcc|'s components. The notation $h+c$ indicates $h$ lines of definitions in ``header'' files and $c$ lines of C code. The ``back-end support'' is back-end code that shared by four back ends, e.g., initialization and most of the register manager. \begin{table} \begin{center} % for v1.3 \newlength{\W}\settowidth{\W}{symbol table emitters\ } \begin{tabular}{l|rcc} & & & \it generated \\ \hfil\it component & \omit\hfil\it code\hfil& \it rules & \it code generator \\ \hline front end & 968+7847 \\[2ex] back-end support & 114+741 \\ VAX & 35+170 & 178 & 5782 \\ MIPS & 40+378 & 104 & 2966 \\ 68020 & 42+190 & 145 & 8301 \\ SPARC & 40+290 & 128 & 3888 \\[2ex] \parbox{\W}{VAX+68020+SPARC\\[-2pt] symbol table emitters} & 584 \\[2ex] \parbox{\W}{naive VAX\\[-2pt] code generator} & 67+578 \\[2ex] rule compiler & 1285 \\ \end{tabular} \end{center} \caption{Number of Lines in {\tt lcc} Components.\label{linecounts}} \end{table} Target-specific files include a configuration header file, which defines parameters like the widths and alignments of the basic datatypes, target-specific interface functions, e.g., those that emit function prologues, and code generation rules, from which the code generators are generated by the rule compiler, which is written in Icon~\cite{griswold:griswold:90}. Retargeting \verb|lcc| requires involves writing these three back-end components, which vary from 377 to 522 lines in existing back ends. In practice, new back ends are implemented by writing new rules and editing copies of an existing configuration and set of interface functions. All of \verb|lcc|'s production back ends use the technology summarized above and detailed in Reference~\cite{fraser:sigplan89}. The interface between the front and back end does not depend on this technology; other back ends that conform to the interface specification can be used. For example, Reference~\cite{fraser:hanson:91a} details a hand-written code generator that emits naive VAX code. While \verb|lcc| uses well established techniques, it uses some of their more recent incarnations, each of which contributes to \verb|lcc|'s efficiency as described below. \subsection{Lexical Analysis} The design of the input module and of the lexical analyzer and judicious code tuning of the lexical analyzer contribute to \verb|lcc|'s speed. Input and lexical analysis use variations of the design described in Reference~\cite{waite86}. Since the lexical analyzer is the only module that inspects every input character, the design avoids extraneous per-character processing and minimizes character movement by scanning tokens directly out of a large input buffer. Input is read directly from the operating system into a 4096-character buffer as depicted in Figure~\ref{fig:buffering}a, and \verb|cp| and \verb|limit| delimit the unscanned portion of the buffer. The next token is scanned by advancing \verb|cp| across white space and switching on the first character of the token, \verb|*cp|. \verb|cp| is advanced as the token is recognized. \begin{figure} \begin{center} \setlength{\unitlength}{10pt} \begin{picture}(26.3,21) \put(0,2){ \begin{picture}(26.3,7) \thicklines \put( 0, 3){\framebox(26.3, 2)[r]{\tt \char`\\n}} \thinlines \put( 5, 3){\line(0,1){2}} \put(25, 3){\line(0,1){2}} \put( 3, 1){\vector(0,1){2}}\put(3, 0){\makebox(0,1){\tt cp}} \put(25.7, 1){\vector(0,1){2}}\put(25.7, 0){\makebox(0,1){\tt limit}} \put(22, 1){\vector(0,1){2}}\put(22, 0){\makebox(0,1){\tt fence}} \put(15, 6){\makebox(0,0){4096 characters}} \put( 5, 5.5){\line(0,1){1}} \put(25, 5.5){\line(0,1){1}} \put(11, 6){\vector(-1,0){6}} \put(19, 6){\vector( 1,0){6}} \end{picture} } \put(13.15,0){\makebox(0,1){(b) After a \tt read}} \put(0,14){ \begin{picture}(26.3,7) \thicklines \put( 0, 3){\framebox(26.3, 2)[r]{\tt \char`\\n}} \thinlines \put( 5, 3){\line(0,1){2}} \put(25, 3){\line(0,1){2}} \put(10, 1){\vector(0,1){2}}\put(10, 0){\makebox(0,1){\tt cp}} \put(25.7, 1){\vector(0,1){2}}\put(25.7, 0){\makebox(0,1){\tt limit}} \put(22, 1){\vector(0,1){2}}\put(22, 0){\makebox(0,1){\tt fence}} \put(15, 6){\makebox(0,0){4096 characters}} \put( 5, 5.5){\line(0,1){1}} \put(25, 5.5){\line(0,1){1}} \put(11, 6){\vector(-1,0){6}} \put(19, 6){\vector( 1,0){6}} \end{picture} } \put(13.15,12){\makebox(0,1){(a) While ${\tt cp} < {\tt fence}$}} \end{picture} \end{center} \caption{Input Buffering.\label{fig:buffering}} \end{figure} Newlines, denoted by \verb|\n|, cannot occur within C tokens, which explains the newline at \verb|*limit| shown in Figure~\ref{fig:buffering}. This newline terminates a scan for any token so a separate, per-character test for the end of the buffer is unnecessary. When a newline is encountered, an input module function is called to refill the input buffer, if necessary, and to increment the line number. ANSI~C stipulates a maximum line length of no less than 509, but few compilers insist on a specific limit. Tokens, however, can be limited to 32 characters; string literals are an exception, but they are handled as a special case. In general, an input buffer ends with a partial token. To insure that an entire token lies between \verb|cp| and \verb|limit|, the end of the buffer is moved to the memory locations {\em preceding} the buffer whenever \verb|cp| passes \verb|fence|. Doing so concatenates a partial token with its tail after the next read as shown in Figure~\ref{fig:buffering}b. Testing if \verb|cp| has passed \verb|fence| is done for each token after \verb|cp| is advanced across white space. The important consequence of this design is that most of the input characters are accessed by \verb|*cp| and many are never moved. Only identifiers (excluding keywords) and string literals that appear in executable code are copied out of the buffer into permanent storage. Reference~\cite{waite86}'s algorithm moves partial {\em lines} instead of partial tokens and does so after scanning the {\em first} newline in the buffer. But this operation overwrites storage before the buffer when a partial line is longer than a fixed maximum. The algorithm above avoids this problem, but at the per-token cost of comparing \verb|cp| with \verb|fence|. Instead of actually using \verb|cp| as suggested above, \verb|cp| is copied to the register variable \verb|rcp| upon entry to the lexical analyzer, and \verb|rcp| is used in token recognition. \verb|rcp| is assigned to \verb|cp| before the lexical analyzer returns. Using \verb|rcp| improves performance and makes scanning loops compact and fast, e.g., white space is elided by \begin{verbatim} while (map[*rcp]&BLANK) rcp++; \end{verbatim} \verb|map[c]| is a mask that classifies character \verb|c| as suggested in Reference~\cite{waite86}; e.g., \verb|map[c]&BLANK| is non-zero if \verb|c| is a white-space character (but not a newline). \verb|lcc| generates four VAX instructions for the body of this loop: \begin{verbatim} jbr L142 L141: incl r11 L142: cvtbl (r11),r5 bicl3 $-2,_map[r5],r5 jneq L141 \end{verbatim} \verb|rcp| is register \verb|r11|. Some optimizing compilers can make similar improvements locally, but not across potentially aliased assignments and calls to other, irrelevant functions. Keywords are recognized by a hard-coded decision tree, e.g., \begin{verbatim} case 'i': if (rcp[0] == 'f' && !(map[rcp[1]]&(DIGIT|LETTER))) { cp = rcp + 1; return IF; } if (rcp[0] == 'n' && rcp[1] == 't' && !(map[rcp[2]]&(DIGIT|LETTER))) { cp = rcp + 2; return INT; } goto id; \end{verbatim} \verb|IF| and \verb|INT| are defined as the token codes for the keywords \verb|if| and \verb|int|, respectively, and \verb|id| labels the code that scans identifiers. This code is generated automatically by a 50-line C program and included in the lexical analyzer during compilation. The VAX code generated for this fragment follows; again, \verb|r11| is \verb|rcp|. \begin{verbatim} L347: cmpb (r11),$102 jneq L348 cvtbl 1(r11),r5 bicl3 $-13,_map[r5],r5 jneq L348 addl3 $1,r11,_cp movl $77,r0 ret L348: cmpb (r11),$110 jneq L226 cmpb 1(r11),$116 jneq L226 cvtbl 2(r11),r5 bicl3 $-13,_map[r5],r5 jneq L226 addl3 $2,r11,_cp movl $5,r0 ret \end{verbatim} Thus, the keyword \verb|int| is recognized by less than a dozen instructions, many less than are executed when a table is searched for keywords, even if perfect hashing is used. As in other compilers~\cite{aho:sethi:ullman:86}, strings that must be saved (identifiers and string literals) are hashed into a table in which a string appears only once, which saves space. For performance, there are variants for installing strings of digits and strings of known length. After installation, strings are known by their addresses and the characters are accessed only for output. For example, looking a name up in the symbol table is done by hashing on the address of the name; string comparison is unnecessary. \subsection{Symbol Tables} Fast symbol table manipulation also contributes to \verb|lcc|'s speed. It took several versions of the symbol table module to arrive at the current one, however. Symbols are represented with structures defined by \begin{verbatim} struct symbol { char *name; /* symbol name */ int scope; /* scope level */ ... }; \end{verbatim} The symbol table module uses hash tables for symbol tables; the initial version used a single table for all scopes, i.e., \begin{verbatim} struct entry { struct symbol sym; /* this symbol */ struct entry *link; /* next entry on hash chain */ }; struct table { struct entry *buckets[HASHSIZE]; /* hash buckets */ }; \end{verbatim} Symbols are wrapped in \verb|entry| structures to keep the linkage information private to the symbol table module. Scope entry required no code. Each new symbol was added to the head of its hash chain and thereby hid symbols with the same names, which appeared further down on the same chains. At scope exit, however, entries at the current scope level, indicated by the value of \verb|level|, were removed from the table \verb|*tp| by the code \begin{verbatim} for (i = 0; i < HASHSIZE; i++) { struct entry *p = tp->buckets[i]; while (p && p->sym.scope == level) p = p->link; tp->buckets[i] = p; } \end{verbatim} Measurements revealed that this code accounted for over 5~percent of \verb|lcc|'s execution time on typical input. This code scanned the hash buckets even for scopes that introduce no new symbols, which are common in C. The second version of the symbol table module used a separate hash table for each scope level: \begin{verbatim} struct table { struct table *previous; /* table at lower scope */ struct entry *buckets[HASHSIZE]; /* hash buckets */ }; \end{verbatim} Searching for a symbol took the same number of comparisons, but also required a traversal of the list of separate tables, e.g., \begin{verbatim} struct symbol *lookup(char *name, struct table *tp) { struct entry *p; unsigned h = ((unsigned)name)&(HASHSIZE-1); do for (p = tp->buckets[h]; p; p = p->link) if (name == p->sym.name) return &p->sym; while (tp = tp->previous); return 0; } \end{verbatim} Notice that symbol names are compared by simply comparing addresses as explained in the previous section. Despite the conventional wisdom about hashing functions~\cite{sedgewick88}, using a power of two for \verb|HASHSIZE| gave better performance; using a prime instead and modulus in place of masking slowed \verb|lcc|. This variation reduced the scope exit code to \begin{verbatim} tp = tp->previous \end{verbatim} for table \verb|*tp|. Unfortunately, scope entry then required allocation and initialization of a table: \begin{verbatim} struct table *new = (struct table *)alloc(sizeof *new); new->previous = tp; for (i = 0; i < HASHSIZE; i++) new->buckets[i] = 0; tp = new; \end{verbatim} So, the time wasted at scope exit in the first version was traded for a similar waste at scope entry in the second version. The symbol table module in actual use avoids this waste by lazy allocation and initialization of tables. Tables include their associated scope level: \begin{verbatim} struct table { int level; /* scope level for this table */ struct table *previous; /* table at lower scope */ struct entry *buckets[HASHSIZE]; /* hash buckets */ }; \end{verbatim} New tables are allocated and initialized only when a symbol is installed: \begin{verbatim} struct symbol *install(char *name, struct table **tpp) { unsigned h = ((unsigned)name)&(HASHSIZE-1); struct table *tp = *tpp; struct entry *p = (struct entry *)alloc(sizeof *p); if (tp->level < level) { int i; struct table *new = (struct table *)alloc(sizeof *new); new->previous = tp; new->level = level; for (i = 0; i < HASHSIZE; i++) new->buckets[i] = 0; *tpp = tp = new; } p->sym.name = name; p->sym.scope = tp->level; ... p->link = tp->buckets[h]; tp->buckets[h] = p; return &p->sym; } \end{verbatim} Since few scopes in C, which are delimited by compound statements, declare new symbols, the lazy allocation code above is rarely executed and entry to most scopes is nearly free. The scope exit code must check before discarding a table, but remains simple: \begin{verbatim} if (tp->level == level) tp = tp->previous; \end{verbatim} This design also simplifies access to separate tables. For example, the table that holds globals is at the end of the list of identifier tables; by making it the value of \verb|globals|, symbols can be installed into it directly. In the initial implementation, a global declared at a nested scope had to be inserted in the middle of its hash chain. \subsection{Storage Management} Allocation and deallocation in early versions of \verb|lcc| accounted for a significant portion of the total execution time. Replacing the naive use of \verb|malloc| and \verb|free| reduced total execution time by about 8--10~percent. As detailed in Reference~\cite{hanson90}, allocation is based on the lifetime of the objects allocated, and all objects with the same lifetime are freed at once. This approach to storage management simplified \verb|lcc|'s code. Initially, each object type had explicit deallocation code, perhaps replicated at several points. Some of this code was intricate, e.g., involving complex loops or recursive data structure traversals. Allocation incurred an obligation to provide the necessary deallocation code, so there was a tendency to use algorithms that avoided allocation, perhaps at the expense of time, complexity, and flexibility. And it was easy to forget deallocation, resulting in storage leaks. The current scheme eliminated nearly all explicit deallocation code, which simplified the compiler and eliminated storage leaks. More importantly, it encouraged the use of simple applicative algorithms, e.g., in rewriting trees. The replacements cost space, but not time, since allocation and deallocation are nearly free. Besides contributing to fast compilation, the other visible benefit of this approach is that \verb|lcc| imposes few arbitrary limits on its input; e.g., it permits any number of cases in switch statements, any number of parameters and locals, block nesting to any depth, expressions of arbitrary complexity, initializations of arbitrary size, etc. These quantities are limited only by the memory available. \section{Optimization} \verb|lcc| is not properly called an ``optimizing'' compiler because it does no global optimization, {\em per se}. Its front end does, however, perform some simple, target-independent transformations that help its back ends generate good local code. The front end eliminates local common subexpressions, folds constant expressions, and makes numerous simple transformations that improve the quality of local code~\cite{hanson83}. Many of these improvements are simple tree transformations that lead to better addressing code. The front end lays out loops so as to reduce the number of unconstructive branches~\cite{baskett78}, e.g., the code for \begin{flushleft} \tt for ($e_1$; $e_2$; $e_3$) $S$ \end{flushleft} has the form \begin{flushleft}\tt \begin{tabular}{ll} & $e_1$ \\ & goto L1 \\ L2: & $S$ \\ L3: & $e_3$ \\ L1: & if ($e_2$) goto L2 \\ \end{tabular} \end{flushleft} The \verb|goto L1| is omitted if $e_2$ is initially non-zero. In addition, the front end eliminates branch chains and dead branches. The selection code for switch statements is generated entirely by the front end. It generates a binary search of dense branch tables~\cite{bernstein85}, where the density is the percentage of non-default branch table entries. For example, with the default density of 0.5, a switch statement with the case values 1, 2, 6--8, 1001--1004, and 2001--2002 has the following VAX selection code. Register \verb|r4| holds the value of the switch expression, \verb|L3|--\verb|15| label the statements for the case values above, and \verb|L1| is the default label. \begin{verbatim} cmpl r4,$1001 jlss L17 cmpl r4,$1004 jgtr L16 movl _18-4004[r4],r5 jmp (r5) _18: .long L8 .long L9 .long L10 .long L11 L17: cmpl r4,$1 jlss L1 cmpl r4,$8 jgtr L1 movl _21-4[r4],r5 jmp (r5) _21: .long L3 .long L4 .long L1 .long L1 .long L1 .long L5 .long L6 .long L7 L16: cmpl r4,$2001 jlss L1 cmpl r4,$2004 jgtr L1 movl _24-8004[r4],r5 jmp (r5) _24: .long L12 .long L13 .long L14 .long L15 \end{verbatim} The density can be changed by a command-line option; e.g., \verb|-d0| yields a single branch table for each switch statement, and \verb|-d1| requires that all branch tables be fully populated. Finally, the front end simulates register declarations for all scalar parameters and locals that are referenced at least 3 times and do not have their addresses taken explicitly. Locals are announced to the back ends with explicitly declared \verb|register| locals followed by the remaining locals in the order of decreasing frequency of use. Each top-level occurrence of an identifier counts as 1 reference. Occurrences in a loop, either of the then/else arms of an if statement, or a case in a switch statement each count, respectively, as 10, $1/2$, or $1/10$ references. These values are adjusted to account for nested control structures. The next section describes how these estimated counts may be replaced with counts from an actual profile. This scheme simplifies register assignment in the back ends, and explicit \verb|register| declarations are rarely necessary. For example, \begin{verbatim} strcpy(char *s1, char *s2) { while (*s1++ = *s2++); } \end{verbatim} yields the VAX code \begin{verbatim} _strcpy:.word 0x0 movl 4(ap),r4 movl 8(ap),r5 L26: movb (r5)+,(r4)+ jneq L26 ret \end{verbatim} \section{Features} \verb|lcc| provides a few noteworthy features that help users develop, debug, and profile ANSI~C programs. For example, an option causes \verb|lcc| to print ANSI-style C declarations for all defined globals and functions. For instance, the code (adapted from Section~6.2 of Reference~\cite{kernighan:ritchie:88}) \begin{verbatim} typedef struct point { int x,y; } point; typedef struct rect { point pt1, pt2; } rect; point addpoint(p1, p2) point p1, p2; { p1.x += p2.x; p1.y += p2.y; return p1; } int ptinrect(p, r) point p; rect r; { return p.x >= r.pt1.x && p.x < r.pt2.x && p.y >= r.pt1.y && p.y < r.pt2.y; } \end{verbatim} generates the declarations \begin{verbatim} extern point addpoint(point, point); extern int ptinrect(point, rect); \end{verbatim} Editing such output can simplify conversion to ANSI~C. Another option causes \verb|lcc| to issue warnings for declarations and casts of function types without prototypes. These include pointers to functions, which are easy to overlook when updating pre-ANSI code. For example, it is likely that \verb|char *(_alloc)()| should be updated to be \verb|char *(_alloc)(size_t)|. \subsection{Debugging} \verb|lcc| supports the standard debugger symbol tables on VAXes and Suns. It also has two options of its own to assist in program debugging. Dereferencing zero pointers is a frequent C programming error. On some systems, execution continues until the consequences cause a fault somewhere unrelated to the actual point of error. To help catch such errors, an option causes \verb|lcc| to generate code to test for dereferencing zero pointers. If a zero pointer is detected, the offending file name and line number are reported on the standard error, e.g., \begin{verbatim} null pointer dereferenced @foo.c:36 \end{verbatim} and the program terminates by calling the standard library function \verb|abort|. Some languages provide built-in facilities for tracing function calls and returns~\cite{griswold:griswold:90}. An option instructs \verb|lcc| to generate calls to \verb|printf| (or a user-specified equivalent) just after entry to each function and just before each return. The entry code prints the arguments and the return code prints the value returned. For example, calling the functions shown above would elicit messages like \begin{verbatim} addpoint#2(p1=(point){x=0,y=0},p2=(point){x=10,y=10}) called addpoint#2 returned (point){x=10,y=10} ... ptinrect#1(p=(point){x=-1,y=-1}, r=(rect){pt1=(point){x=10,y=10},pt2=(point){x=310,y=310}}) called ptinrect#1 returned 0 \end{verbatim} (Long lines have been folded to fit this page.) As illustrated by this output, the messages show the full details of the arguments, including structure contents. The numbers that follow function names, e.g., \verb|#2|, are activation numbers and can help locate a specific call and its return. These debugging options are implemented entirely in the front end and thus are available on all of \verb|lcc|'s targets. \subsection{Profiling} \verb|lcc| supports \verb|prof|-style (viz.~\cite[\verb|prof| command]{bsd84}) and \verb|gprof|-style~\cite{graham:kessler:mckusick:83} execution profiling on VAXes and Suns. These profilers sample the location counter periodically to obtain an estimate of the percentage of total execution time spent in each function, and they report the number of calls to each function. Heeding long-standing advice~\cite{knuth71,sites78}, \verb|lcc| also supports frequency-based profiling. An option causes \verb|lcc| to emit counters that record the number of times each {\em expression} is executed, and the values of these counters are written to the file \verb|prof.out| when the program terminates. A companion program, \verb|bprint|, reads \verb|prof.out| and prints the source code annotated with execution counts, e.g., \begin{verbatim} ... 4 main() 5 <1>{ ... 12 <1>queens(0); 13 return <1>0; 14 <1>} 15 16 queens(c) 17 <1965>{ 18 int r; 19 20 for (<1965>r = 0; <15720>r < 8; <15720>r++) 21 if (<15720>rows[r] && <5508>up[r-c+7] && <3420>down[r+c]){ 22 <2056>rows[r] = up[r-c+7] = down[r+c] = 0; 23 <2056>x[c] = r; 24 if (<2056>c == 7) 25 <92>print(); 26 else 27 <1964>queens(c + 1); 28 <2056>rows[r] = up[r-c+7] = down[r+c] = 1; 29 } 30 <1965>} ... \end{verbatim} Execution counts are enclosed in angle brackets. The counts on the outermost braces for \verb|queens| give the number of calls. Line 21 shows the benefit of associating a count with each expression instead of each line; the counts reveal that \verb|up[r-c+7]| was tested only slightly more than one-third of the number of times the if statement was executed. Conditional expressions are annotated similarly. Users sometimes report an ``off-by-one'' bug when they see that \verb|r < 8| in line 20 was executed the same number of times as \verb|r++|. These counts are a consequence of the way \verb|lcc| lays out for loops and eliminates the test before the first iteration, as described above. Data in \verb|prof.out| accumulates, so it is possible to execute a program repeatedly and then have \verb|bprint| display the cumulative frequencies. This method is particularly useful for developing test data that exercises all parts of a program: \verb|<0>| highlights untested code. Another option causes \verb|lcc| to read \verb|prof.out| and use the counts therein to compute the frequency of use of each identifier instead of using the estimates described in the previous section. Doing so may reduce the number of uses for identifiers that appear in loops that rarely executed more than once, and increase the number of uses for those that appear in then/else arms that are executed most of the time. Complex preprocessor macros can obscure \verb|bprint|'s presentation. It necessarily uses post-expansion source coordinates to annotate pre-expansion source files. Profiling code also records the number of calls made from each call site, which can be used to reconstruct the dynamic call graph. \verb|bprint| prints a line for each edge, e.g., \begin{verbatim} 1 queens from main in 8q.c:12.8 1964 queens from queens in 8q.c:27.11 92 print from queens in 8q.c:25.10 \end{verbatim} This output shows that all but one of the calls to \verb|queens| was from the call at character 11 in line 27. This kind of data is particularly helpful in identifying hot spots that are caused by inappropriate calls to a function instead of inefficiencies within the function itself. Such data can also help identify functions that might profitably be replaced with two functions so that one can handle the common case more efficiently~\cite[Sec.~5.3]{bentley82}. Expression execution frequency profiling is implemented entirely by the front end. The only machine dependency is the name of the ultimate termination function in the revised \verb|exit| function that writes \verb|prof.out| at program termination. The implementation is a machine-independent variation of the method described in Reference~\cite{weinberger84}. The front end generates an array of counters for each file and starts each expression with code to increment the appropriate counter. In also builds a parallel array that holds the source coordinates corresponding to each counter. At the entry point of each function, the front end generates the equivalent of \begin{verbatim} if (!_yylink.link) { extern struct _bbdata *_bblist; _yylink.link = _bblist; _bblist = &yylink; } _prologue(&callee); \end{verbatim} A \verb|_bbdata| structure is generated for each file: \begin{verbatim} static struct _bbdata { struct _bbdata *link; unsigned npoints; unsigned *counts; unsigned *coords; struct func *funcs; } _yylink; \end{verbatim} The \verb|counts| and \verb|coords| fields point the arrays mentioned above, which each have \verb|npoints| entries. The entry point code uses the \verb|link| field to add each file's \verb|_bbdata| structure to the list headed by \verb|_bblist|, which the revised \verb|exit| function walks to emit \verb|prof.out|. \verb|_prologue| accumulates the dynamic call graph. It is passed one of the \verb|func| structures --- one for each function --- that appear on the list emanating from \verb|_yylink.funcs|: \begin{verbatim} struct func { struct func *link; struct caller { struct caller *link; struct callsite *caller; unsigned count; } *callers; char *name; unsigned coord; }; \end{verbatim} The \verb|name| and \verb|coord| fields give the function's name and beginning source coordinate, respectively. \verb|callers| points to a list of \verb|caller| structures, one for each call site. Each \verb|caller| structure records the number of calls from the caller's \verb|callsite|: \begin{verbatim} struct callsite { char *file; char *name; unsigned coord; }; \end{verbatim} \verb|caller| structures are allocated at execution time and point to \verb|callsite|s, which are generated by the front end at compile time. Just before each call, the front end generates an assignment of a pointer to a \verb|callsite| structure to the global variable \verb|_caller|. \verb|_prologue| uses \verb|_caller| to record an edge in the dynamic call graph. If a record of the caller already exists, its count is simply incremented. Otherwise, a \verb|caller| structure is allocated and prefixed to the callee's list of callers. \begin{verbatim} _prologue(struct func *callee) { static struct caller callers[4096]; static int next; struct caller *p; for (p = callee->callers; p; p = p->link) if (p->caller == _caller) { p->count++; break; } if (!p && next < sizeof callers/sizeof callers[0]) { p = &callers[next++]; p->caller = _caller; p->count = 1; p->link = callee->callers; callee->callers = p; } _caller = 0; } \end{verbatim} Profiling can be restricted to only those files of interest. The counts printed by \verb|bprint| will be correct, but some edges may be omitted from the call graph. For example, if \verb|f| calls \verb|g| calls \verb|h| and \verb|f| and \verb|h| are compiled with profiling, but \verb|g| is not, \verb|bprint| will report that \verb|f| called \verb|h|. The total number of calls to each function is correct, however. \section{Performance} \verb|lcc| emits local code that is comparable to that emitted by the generally available alternatives. Table~\ref{spec:execution} summarizes the results of compiling and executing the C programs in the SPEC benchmarks~\cite{spec89} with three compilers on the four machines listed above. Configuration details are listed with each machine. \verb|cc| and \verb|gcc| denote, respectively, the manufacturer's C compiler and the GNU~C compiler from the Free Software Foundation. The times are elapsed time in seconds and are the lowest elapsed times over several runs on lightly loaded machines. All reported runs achieved at least 97 percent utilization (i.e., the ratio of times $({\it user} + {\it system})/{\it elapsed} \ge 0.97$). The entries with \verb|-O| indicate compilation with the ``default'' optimization, which often includes some global optimizations. \verb|lcc| performs no global optimizations. The \verb|gcc| and \verb|gcc -O| figures for {\tt gcc1.35} on the MIPS are missing because this benchmark did not execute correctly when compiled with \verb|gcc|. \newcommand{\0}{\hspace{0.5em}} \begin{table} \begin{center} \begin{tabular}{lcccc} &\multicolumn{4}{c}{\it benchmark}\\ \it compiler & 1. \tt gcc1.35& 8. \tt espresso& 22. \tt li & 23. \tt eqntott \\ \hline %\makebox[0pt][l]{VAX: VAX 8650 w/36MB running 4.3BSD UNIX} \\% % megaron % v1.3 %\tt lcc & 345 & 588 & 1315 & 296 \\ %\tt cc & 350 & 504 & 1471 & 334 \\ %\tt gcc & 304 & 501 & 1316 & 327 \\ %v1.36 %\tt cc -O & 320 & 525 & 1486 & 325 \\ %\tt gcc -O & 283 & 466 & 1272 & 201 \\[2ex] \\[-.5ex] \makebox[0pt][l]{VAX: \small MicroVAX II w/16MB running Ultrix 3.1} \\% v1.3; ffserver \tt lcc & 1734 & 2708 & 7015 & 3532 \\ \tt cc & 1824 & 2782 & 7765 & 3569 \\ \tt gcc & 1439 & 2757 & 7353 & 3263 \\ %v1.36 \tt cc -O & 1661 & 2653 & 7086 & 3757 \\ \tt gcc -O & 1274 & 2291 & 6397 & 1131 \\[2ex] \makebox[0pt][l]{68020: \small Sun 3/60 w/24MB running SunOS 4.0.3} \\% v1.3; hemlock \tt lcc & 544 & 1070 & 2591 & 567 \\ \tt cc & 514 & 1005 & 3308 & 619 \\ \tt gcc & 426 & 1048 & 2498 & 591 \\ %v1.35 \tt cc -O & 428 &\0882 & 2237 & 571 \\ \tt gcc -O & 337 &\0834 & 1951 & 326 \\[2ex] \makebox[0pt][l]{MIPS: \small IRIS 4D/220GTX w/32MB running IRIX 3.3.1} \\% v1.3; oyoy \tt lcc & 116 & 150 & 352 & 111 \\ \tt cc & 107 & 153 & 338 & 100 \\ \tt gcc & & 188 & 502 & 132 \\ %v1.36 \tt cc -O &\092 & 130 & 299 &\070 \\ \tt gcc -O & & 145 & 411 & 112 \\[2ex] %\makebox[0pt][l]{SPARC:\hspace{1em}}% phoenix: Sun 4/490 w/128MB, SunOS 4.1 %\tt lcc & 106 & 218 & 436 & 129 \\ %\tt cc &\099 & 189 & 589 & 133 \\ %\tt gcc &\094 & 205 & 589 & 131 \\ %v1.36 %\tt cc -O &\078 & 154 & 351 &\089 \\ %\tt gcc -O &\064 & 153 & 384 &\092 \\ % \makebox[0pt][l]{SPARC: \small Sun 4/260 w/32MB running SunOS 4.0.3}\\% v1.3; python \tt lcc & 196 & 370 &\0790 & 209 \\ \tt cc & 203 & 381 & 1094 & 275 \\ \tt gcc & 186 & 411 & 1139 & 256 \\ %v1.34 \tt cc -O & 150 & 296 &\0707 & 183 \\ \tt gcc -O & 127 & 309 &\0788 & 179 \\ \end{tabular} \end{center} \caption{Execution Time for C SPEC Benchmarks in Seconds.\label{spec:execution}} \end{table} \verb|lcc| is faster than many (but not all~\cite{thompson90}) other C compilers. Table~\ref{spec:compilation} parallels Table~\ref{spec:execution}, but shows compilation time instead of execution time. Except for the MIPS, the times are for running only the compiler proper; preprocessing, assembly, and linking time are not included. Two times are given for the MIPS because the manufacturer's \verb|cc| front end consists of two programs; the first translates C to ``u-code'' and the second generates object code. Generating assembly language costs {\em more} than generating object code, so Table~\ref{spec:compilation} gives both times for all compilers. The last row in Table~\ref{spec:compilation} lists the number of non-blank lines and the total number of lines in each benchmark {\em after} preprocessing. \begin{table} \begin{center} \begin{tabular}{lcccc} &\multicolumn{4}{c}{\it benchmark}\\ \it compiler & 1. \tt gcc1.35& 8. \tt espresso& 22. \tt li & 23. \tt eqntott \\ \hline %\makebox[0pt][r]{VAX:\hspace{1em}}% megaron %\tt lcc & 189 &\059 & 31 &\09 \\ %\tt cc & 405 & 114 & 40 & 26 \\ %\tt gcc & 505 & 222 & 74 & 45 \\[-.5ex] \makebox[0pt][l]{VAX:}\\% v1.3; ffserver \tt lcc &\0792 & 237 &\069 & 36 \\ \tt cc & 1878 & 576 & 174 & 79 \\ \tt gcc & 1910 & 637 & 192 & 86 \\[2ex] \makebox[0pt][l]{68020:}\\% v1.3; hemlock \tt lcc & 302 &\090 & 28 & 15 \\ \tt cc & 507 & 168 & 52 & 29 \\ \tt gcc & 599 & 196 & 56 & 27 \\[2ex] \makebox[0pt][l]{MIPS:}\\% v1.3; oyoy \tt lcc &\097 195 &\035 \063 & 10 24 &\06 16\\ \tt cc & 318 177 & 104 \068 & 40 26 & 24 19 \\ \tt gcc & 320 391 &\088 118 & 28 42 & 13 24 \\[2ex] \makebox[0pt][l]{SPARC:}\\% v1.3; python \tt lcc & 103 &\038 & 12 & \08 \\ \tt cc & 175 &\060 & 18 & 11 \\ \tt gcc & 313 & 100 & 31 & 16 \\[2ex] line counts & 79102/250496 & 25717/58516 & 7070/22494 & 2680/6569 \\ \end{tabular} \end{center} \caption{Compilation Time for C SPEC Benchmarks in Seconds.\label{spec:compilation}} \end{table} \verb|lcc| is smaller than other compilers. Table~\ref{sizes} lists the sizes of the three compilers in kilobytes. Each entry is the sum of sizes of the program and data segments for the indicated compiler as reported by the UNIX \verb|size| command. \begin{table} \begin{center} \begin{tabular}{lcccc} \it compiler & VAX & 68020 & MIPS & SPARC \\ \hline \tt lcc & 181 & 244 & 280 & 276 \\ %v1.3 \tt cc & 256 & 306 & 616 & 402 \\ \tt gcc & 378 & 507 & 777 & 689 \\ \end{tabular} \end{center} \caption{Sizes of Compiler Executables in Kilobytes.\label{sizes}} \end{table} %\bibliography{refs,lib} \begin{thebibliography}{10} \bibitem{aho:sethi:ullman:86} A.~V. Aho, R.~Sethi, and J.~D. Ullman. \newblock {\em Compilers: Principles, Techniques, and Tools}. \newblock Addison Wesley, Reading, MA, 1986. \bibitem{ansi:Cstandard} American National Standard Institute, Inc., New York. \newblock {\em American National Standards for Information Systems, Programming Language C {ANSI} X3.159--1989}, 1990. \bibitem{baskett78} F.~Baskett. \newblock The best simple code generation technique for while, for, and do loops. \newblock {\em SIGPLAN Notices}, 13(4):31--32, Apr. 1978. \bibitem{bentley82} J.~L. Bentley. \newblock {\em Writing Efficient Programs}. \newblock Prentice Hall, Englewood Cliffs, NJ, 1982. \bibitem{bernstein85} R.~L. Bernstein. \newblock Producing good code for the case statement. \newblock {\em Software---Practice \& Experience}, 15(10):1021--1024, Oct. 1985. \bibitem{bsd84} Computer Science Division, Department of Electrical Engineering and Computer Science, University of California, Berkeley, CA. \newblock {\em UNIX User's Manual, Reference Guide}, virtual {VAX}-11 version edition, Mar. 1984. \bibitem{fraser:sigplan89} C.~W. Fraser. \newblock A language for writing code generators. \newblock {\em Proceedings of the SIGPLAN'89 Conference on Programming Language Design and Implementation, SIGPLAN Notices}, 24(7):238--245, July 1989. \bibitem{fraser:hanson:91a} C.~W. Fraser and D.~R. Hanson. \newblock A code generation interface for {ANSI C}. \newblock {\em Software---Practice \& Experience}, 21(9):963--988, Sept. 1991. \bibitem{fraser:hanson:92} C.~W. Fraser and D.~R. Hanson. \newblock Simple register spilling in a retargetable compiler. \newblock {\em Software---Practice \& Experience}, 22(1):85--99, Jan. 1992. \bibitem{graham:kessler:mckusick:83} S.~L. Graham, P.~B. Kessler, and M.~K. McKusick. \newblock An execution profiler for modular programs. \newblock {\em Software---Practice \& Experience}, 13(8):671--685, Aug. 1983. \bibitem{griswold:griswold:90} R.~E. Griswold and M.~T. Griswold. \newblock {\em The Icon Programming Language}. \newblock Prentice Hall, Englewood Cliffs, NJ, second edition, 1990. \bibitem{hanson83} D.~R. Hanson. \newblock Simple code optimizations. \newblock {\em Software---Practice \& Experience}, 13(8):745--763, Aug. 1983. \bibitem{hanson90} D.~R. Hanson. \newblock Fast allocation and deallocation of memory based on object lifetimes. \newblock {\em Software---Practice \& Experience}, 20(1):5--12, Jan. 1990. \bibitem{kernighan:ritchie:88} B.~W. Kernighan and D.~M. Ritchie. \newblock {\em The C Programming Language}. \newblock Prentice Hall, Englewood Cliffs, NJ, second edition, 1988. \bibitem{knuth71} D.~E. Knuth. \newblock An empirical study of {FORTRAN} programs. \newblock {\em Software---Practice \& Experience}, 1(2):105--133, Apr. 1971. \bibitem{sedgewick88} R.~Sedgewick. \newblock {\em Algorithms}. \newblock Addison Wesley, Reading, MA, 1988. \bibitem{sites78} R.~L. Sites. \newblock Programming tools: Statement counts and procedure timings. \newblock {\em SIGPLAN Notices}, 13(12):98--101, Dec. 1978. \bibitem{spec89} Standards Performance Evaluation Corp. \newblock {\em {SPEC} Benchmark Suite Release 1.0}, Oct. 1989. \bibitem{thompson90} K.~Thompson. \newblock A new {C} compiler. \newblock In {\em Proceedings of the Summer 1990 UKUUG Conference}, pages 41--51, London, July 1990. \bibitem{waite86} W.~M. Waite. \newblock The cost of lexical analysis. \newblock {\em Software---Practice \& Experience}, 16(5):473--488, May 1986. \bibitem{weinberger84} P.~J. Weinberger. \newblock Cheap dynamic instruction counting. \newblock {\em Bell System Technical Journal}, 63(8):1815--1826, Oct. 1984. \end{thebibliography} \end{document}

This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.