|
|
1.1 root 1: .LP
2: -
3: .ft 3
4: .ce 99
5: .sp 2i
6: .LG
7: Profiling in the Presence of Optimization and Garbage Collection
8: .sp
9: .ft 2
10: .ce 99
11: .NL
12: Andrew W. Appel*
13: Bruce F. Duba\(dg
14: David B. MacQueen\(dg
15: .sp 0.6i
16: .ce 99
17: .NL
18: November 1988
19: .sp 1i
20: .ce
21: .ft 2
22: ABSTRACT
23: .ft 1
24: .IP
25: Profiling the execution of programs can be a great help in tuning their
26: performance, and programs written in functional languages are no exception.
27: The standard techniques of call-counting and statistical (interrupt-driven)
28: execution time measurement work well, but with some modification. In
29: particular, the program counter is not the best indicator of ``current
30: function.''
31: Our profiler inserts explicit increment and assignment statements into
32: the intermediate representation, and is therefore very simple to implement
33: and completely independent of the code-generator.
34: .LP
35: .sp 1i
36: .nr PS 8
37: .nr VS 10
38: .LP
39: * Supported in part by NSF Grant CCR-8806121 and by a Digital Equipment Corp.
40: Faculty Incentive Grant.
41: .LP
42: \(dg AT&T Bell Laboratories, Murray Hill, NJ.
43: .nr PS 10
44: .nr VS 16
45: .LP
46: .bp
47: .NH
48: Execution profiling
49: .LP
50: A large program usually consists of many small functions.
51: When such a program is to be tuned for efficiency, it is necessary to
52: identify which of those functions are taking the bulk of the execution time.
53: Then the commonly-used functions can be made more efficient, or called less
54: often, or both. By using a theoretical analysis of the algorithms used
55: in a program, such functions can be identified; but a complete
56: theoretical analysis is complex and
57: impractical for large programs.
58:
59: An execution profiler provides an empirical measurement of the time spent
60: in each function. A widely-used Unix tool, \fBprof\fP [Unix],
61: provides a count of how many times each function is called, and how many
62: seconds were spent in each function. This information is very useful
63: in identifying which functions are in need of improvement, after which
64: a theoretical analysis of just those functions might be carried out,
65: a much less forbidding endeavor than analyzing the whole program.
66:
67: \fBProf\fP gathers
68: call-count information
69: by having the compiler insert at the beginning of each function
70: an instruction that increments a call-count variable associated with
71: the function. An approximation to the amount
72: of total time spent in the function is gathered by the use of a timer
73: interrupt: every 1/60th of a second, the operating system notes
74: in a ``histogram'' array the value of the program counter.
75: (This is an ancient technique [Johnston70].)
76: Then, at the end of program execution, \fBprof\fP estimates
77: the amount of time spent in each
78: function by summing the values in the histogram array
79: corresponding to program counter samples between the beginning and end of the
80: machine code for that function.
81: The interrupt-driven sampling method has a much lower overhead than
82: querying a clock on each entry to and exit from a function.
83:
84: A more elaborate profiling tool, \fBgprof\fP [Graham82],
85: provides even more information.
86: When one primitive function (e.g. a
87: table-lookup routine) is used in many places, it is useful to know,
88: not only the total time for the execution of the primitive, but also, how
89: much time to ``charge'' the calling functions. With \fBgprof\fP this is
90: approximated by keeping a count of the number of
91: times each call-site is used, and on that basis apportioning the
92: average execution time
93: of the called function to the functions that call it.
94:
95: These approaches to execution-time measurement and apportionment pose certain
96: problems for optimizing compilers and for functional languages:
97: .IP 1.
98: The machine-code for a function is not necessarily all contiguous.
99: A function may be turned into several pieces of code, with portions of
100: the code for other functions interspersed. This problem could certainly
101: be solved by elaborate bookkeeping in the optimizer and code generator,
102: but we wanted to avoid that complexity.
103: .IP 2.
104: An optimizer can expand functions in-line in other functions.
105: The program-counter method will charge the calling function
106: instead of the called function,
107: even though it might be desirable for in-line expansion to be made
108: semantically invisible.
109: .IP 3.
110: The histogram array in \fBprof\fP
111: must be proportional in size to the address range spanned by pieces of
112: code for executable functions. Our runtime
113: system intersperses code and data
114: throughout memory; even worse, it periodically garbage collects, moving
115: code and data from place to place. This problem could have been solved
116: by elaborate bookkeeping in the runtime system, which we also wanted to avoid.
117: .LP
118: We had to deal with these problems in the course of implementing a
119: profiler for an optimizing compiler for the functional language
120: Standard ML [Appel87]. The approach we used is described in the next
121: section.
122: .NH
123: Intermediate Representation of call-counting and current-function
124: .LP
125: For execution-time estimation we use a timer interrupt, as does
126: \fBprof\fP, to increment a histogram entry. However, we don't use the
127: program counter to calculate which histogram entry to increment.
128: Instead, we maintain a ``pointer-to-current-function-entry'' in a
129: global variable called \fBcurrent\fP that is accessible to the
130: timer-interrupt handler. Each function has associated with it two
131: auxiliary variables: a call-count and an interrupt-count. On entry to
132: a function, it increments the call-count and assigns the address of
133: the interrupt-count variable into the global \fBcurrent\fP variable.
134: Then, when a timer interrupt occurs, the interrupt handler just increments the
135: variable that \fBcurrent\fP points to.
136:
137: When a function returns \(em either normally or via an exception \(em
138: \fBcurrent\fP must be set back to the interrupt-count variable of the
139: function that it is returning to. This resetting could be done either
140: by the calling function (after the called function has returned), or by
141: the called function before exit. For several reasons, it is better done by
142: the calling function. If the called function does the reset, a stack
143: of current-function pointers is required; this is expensive to maintain.
144: A stack of current-pointers would also greatly complicate the treatment
145: of exception-handlers; with the caller-reset method, the exception-handler
146: justs sets the \fBcurrent\fP to point at the appropriate counter variable
147: on entry.
148: On recursive calls, \fBcurrent\fP need not be reset (as the calling
149: and called function are the same), but only the calling function knows
150: which calls are recursive. And finally, tail-calls can be optimized
151: if the caller resets \fBcurrent\fP.
152:
153: A tail-call is one that is not followed by any
154: executable code before the function returns. After a tail-call, the
155: function will immediately return and therefore \fBcurrent\fP will
156: immediately be reset. Therefore, it is not necessary for the calling
157: function to reset \fBcurrent\fP after a tail-call.
158: This is a useful optimization, and it is particularly important
159: when used with a compiler that optimizes tail-calls into jumps; if the
160: current pointer had to be reset after the tail-call, it would no
161: longer be a tail-call and performance would suffer dramatically.
162: Fortunately, it is easy to identify tail-calls statically as the
163: profiling instructions are being inserted.
164:
165: We insert the profiling instructions as ordinary assignment statements
166: in the intermediate representation. In almost any compiler's
167: intermediate representation it is easy to represent the operations
168: of fetching, adding one, and storing, for the call-count increment operation;
169: and storing, for the assignment to the \fBcurrent\fP variable.
170:
171: Functional programming languages introduce another problem for the
172: design of profilers: what to do with anonymous, first-class functions.
173: The simplest choice is to do nothing; collect no call-counts and let
174: the time be charged to the caller of the unnamed function. The main
175: disadvantage of this solution, besides not having the call-counts, is
176: that there is no convenient way to find the code that contributes to
177: the cost of a profiled function that calls anonymous functions.
178:
179: Probably the most general solution is to make up names for the unnamed
180: functions (for example, an unnamed function statically enclosed in function
181: \fIf\fP might be called \fIf.anon\fP).
182: If anonymous functions are given names they can be treated
183: just as any other function; call counts and execution time will be
184: reported. Of course, the user will need to associate the new names
185: with the correct function, but in practice this is rarely a problem.
186: .NH
187: An example
188: .LP
189: To illustrate the technique, we present a simple example (figure 1). The ML function
190: \f(CWsubset\fP takes a predicate function as an argument, and returns
191: a function that maps lists to lists; the output list will be that sublist
192: of the input list containing just those elements that satisfy the predicate.
193: The user's program is displayed in typewriter font; the compiler puts some
194: scaffolding around it (indicated in italics) to make a record
195: containing all the functions declared by the user.
196: .KF
197: .DS
198: .ft CW
199: \fIlet\fP fun subset pred =
200: let fun f nil = nil
201: | f (a::r) = if pred a then a::f(r) else f(r)
202: in f
203: end
204:
205: fun isPrime x =
206: let fun test i = i>=x orelse (x mod i <> 0 andalso test(i+1))
207: in test 2
208: end
209:
210: val primes = subset isPrime
211: \fI in (subset, isPrime, primes)
212: end\fP
213: .ft R
214: .DE
215: .DS C
216: Figure 1.
217: .DE
218: .KE
219: If this code is compiled with profiling enabled, the compiler inserts
220: the call-counting and current-function instructions into the intermediate
221: representation. Here, we display the effects as if written in the source language
222: (figure 2).
223: .KF
224: .DS
225: .ft CW
226: \fIlet val subset.CC = ref 0 and subset.IC = ref 0
227: and subset.f.CC = ref 0 and subset.f.IC = ref 0
228: and isPrime.CC = ref 0 and isPrime.IC = ref 0
229: and isPrime.test.CC = ref 0 and isPrime.test.IC = ref 0
230: \fP
231: fun subset pred =
232: \fI(subset.CC := !subset.CC + 1;\fP
233: \fIcurrent := subset.IC;\fP
234: let fun f x =
235: \fI(subset.f.CC := !subset.f.CC + 1;\fP
236: \fIcurrent := subset.f.IC;\fP
237: case x of
238: nil => nil
239: | a::r => let val pa = pred a
240: in \fIcurrent := subset.f.IC;\fP
241: if pa then a :: f(r) else f(r)
242: end
243: in f
244: end
245:
246: fun isPrime x = \fI(isPrime.CC := !isPrime.CC + 1;\fP
247: \fIcurrent := isPrime.IC;\fP
248: . . . )
249:
250: val primes = subset isPrime
251: \fI
252: in ((subset, isPrime, primes),
253: ((subset.CC, subset.IC, "subset"),
254: (subset.f.CC, subset.f.IC, "subset.f"),
255: (isPrime.test.CC, isPrime.test.IC, "isPrime.test"),
256: (isPrime.CC, isPrime.IC, "isPrime")))
257: end\fP
258: .ft R
259: .DE
260: .DS C
261: Figure 2.
262: .DE
263: .KE
264:
265: For each function, two variables are introduced: a call-count and an interrupt-count.
266: On entry to a function, the call-count is incremented, and the global variable
267: \fBcurrent\fP is set to point to the interrupt-count. On re-entry to a function
268: after a subroutine call, \fBcurrent\fP is reset to the function's
269: interrupt-count variable. However, this is not necessary after recursive calls
270: and tail calls, e.g. the calls to \f(CWf\fP.
271:
272: The initial \fIlet\fP-bindings create all the count variables, and the
273: last four lines produce, instead of just a record containing the user's
274: declared objects, a pair of records: the user's declared objects, and a
275: list of records containing profiling variables, each with an identifying
276: string constant. These string constants will be embedded in the executable
277: code for this module, and will enable the call-count variables to be self-identifying.
278: Our runtime system maintains a global list of these
279: 3-element records; when it is time to print an execution profile, they are
280: sorted in decreasing order of interrupt-count.
281:
282: .KS
283: Our output looks like the output of \fBprof\fP:
284: .DS
285: .ft CW
286: %time cumsecs #call ms/call name
287: 90.4 3.52 78189 .045 isPrime.test
288: 8.4 3.85 1000 .330 isPrime
289: .7 3.88 0 (unprofiled)
290: .2 3.89 1001 .009 subset.f
291: .0 3.89 1001 .000 natlist
292: .0 3.89 1 .000 subset
293: .ft R
294: .DE
295: .KE
296:
297: Now, armed with this information, a programmer
298: might decide that it is worthwhile
299: re-writing the \f(CWisPrime\fP function
300: to make it as efficient as possible. But at a certain point
301: the programmer will want to know what functions are calling \f(CWisPrime\fP
302: so he can make them call it less often. By re-compiling with
303: \f(CWisPrime\fP unprofiled, any time spent in \f(CWisPrime\fP will
304: now be charged to the function that called it. This is because
305: \f(CWisPrime\fP will not change the \fBcurrent\fP variable, so that the
306: timer-interrupt will increment the count for the function that last
307: set \fBcurrent\fP \(em and this will be the one that called \fPisPrime\fP.
308: The profiling system won't do this automatically, but by comparing two
309: different execution profiles, one with \f(CWisPrime\fP compiled with
310: profile instructions and one with \f(CWisPrime\fP unprofiled, an accurate estimate
311: can be made of who is calling it.
312: .NH
313: Advantages of our current-function method
314: .LP
315: Since we use ordinary intermediate-representation operators
316: for profiling,
317: the optimizer and code-generator ``believe'' that
318: profiling operations are part of the program.
319: Since an optimizer must not modify the semantics of the program,
320: the semantics of profiling will not be modified either.
321: Therefore, if one function is copied and inserted in-line into another,
322: the call-count and current-function instructions will be copied and
323: inserted at the right place. Other optimizations that break functions
324: into several disjoint pieces of code will leave the profiling
325: instructions in the appropriate places.
326:
327: Furthermore, the result is that the implementation of the profiler
328: is completely independent of the code generator. We have four different
329: code generators for our compiler (two different algorithms each for the
330: Vax and the Motorola 68020), and not a line of any of them was modified
331: for the installation of the profiler.
332:
333: By compiling some functions unprofiled, as described in the previous section,
334: we can find out what callers are responsible for most of their execution time.
335: This kind of trick serves much the same purpose that the more elaborate
336: program \fBgprof\fP does; and it's a trick that wouldn't work with
337: a program-counter histogram.
338: Furthermore, our method is more accurate than \fBgprof\fP. Suppose
339: functions \fIf\fP and \fIg\fP both call a function \fIisPrime\fP, but
340: \fIf\fP consistently makes expensive calls (that take a long time) while
341: \fIg\fP makes cheap ones. \fBGprof\fP allocates the total time spent in
342: \fIisPrime\fP on the basis of call counts from \fIf\fP and \fIg\fP;
343: this will miss the fact that \fIf\fP is responsible for most of the cost.
344: In this example, when profiling for \fIisPrime\fP is turned off, \fIf\fP
345: and \fIg\fP will be charged for the actual time spent in \fIisPrime\fP
346: on their behalf. (On the other hand, \fBGprof\fP will give an accurate
347: breakdown of call-site counts that our method does not provide.)
348:
349: If a profiled function calls an unprofiled function, then during the
350: execution of the called function, all timer interrupts will be charged
351: to the caller (since \fBcurrent\fP still points to the caller's
352: variable). This is often desirable, as described above. But if an
353: unprofiled function calls a profiled function, then upon return to the
354: unprofiled function the \fBcurrent\fP pointer won't be reset, and
355: interrupts will continue to be charged to the called function after it
356: has returned. This is undesirable, and should be prevented by the
357: compiler. In a language with first-class functions, it is difficult
358: to prevent profiled functions from being passed as arguments to
359: unprofiled functions that might then call them. In practice, this has
360: not proved to be a problem, probably because unprofiled functionals
361: are typically simple primitives like \fIapp\fP and \fImap\fP, which do
362: little intrinsic computation.
363: .NH
364: Overhead measurements
365: .LP
366: We ran the same program several times with various of our profiling
367: features enabled; this gives a reasonably accurate measurement of profiling
368: overhead:
369: .KF
370: .TS
371: tab(|) box center;
372: l c c
373: l n n.
374: |Time|%Overhead|LenL|LenR|Comp
375: _
376: User code|2801 sec|
377: _
378: Call counts|568|20.3%
379: Setting current function|286|10.2
380: Interrupts|47|1.7
381: _
382: Total Overhead|901|32.2%
383: .TE
384: .KE
385: The total overhead of 32% is not prohibitively expensive. Our code generator
386: takes three instructions to increment a call-count (fetch, add, store);
387: a better instruction-selector could probably reduce this overhead to 8%, and the
388: total overhead to 20%.
389:
390: .KF
391: There is also an implementation overhead; it turned out to be fairly simple
392: to get this profiler running.
393: .TS
394: tab(|) center;
395: l n.
396: Insertion of profiling instructions|49 lines
397: Interrupt handling|32
398: Global database|16
399: Report generation|72
400: _
401: Total|169 lines
402: .TE
403: In contrast, this paper is about 500 lines long.
404: .KE
405: .NH
406: Conclusion
407: .LP
408: Traditional approaches to profiling run into problems when we attempt to apply
409: them to functional languages where code may be moved around by garbage collection,
410: and the task is further complicated when an optimizing compiler freely rearranges
411: the code. The basic difficulty is that the mapping between the current pc and the
412: currently executing function is difficult to maintain.
413:
414: We have found a simple way around this difficulty, which consists of
415: maintaining a global variable that always points to the interrupt
416: count for the current function, and which is to be charged whenever
417: there is a timer interrupt. Because we manipulate this variable in the
418: intermediate representation of the compiler, our method is very easy
419: to implement and has no nasty interactions with code generation or
420: garbage collection algorithms (which already preserve semantics of
421: intermediate-representation operations).
422:
423: This method has acceptable overhead and accuracy. Furthermore, by
424: judiciously mixing profiled and unprofiled functions, one can extract
425: information on inherited costs as well as the direct costs of calling
426: particular functions. This information is similar to that provided by
427: sophisticated profilers like gprof, but is more accurate.
428: .SH
429: References
430: .LP
431: .IP [Appel87] 1i
432: Appel, Andrew W. and MacQueen, David B. ``A Standard ML compiler,''
433: in \fIFunctional Programming Languages and Computer Architecture\fP,
434: LNCS 274, G. Kahn, ed., pp 301-324, 1987
435: .IP [Graham82]
436: Graham, Susan L. Graham, Peter B. Kessler, and Marshall K. McKusick.
437: ``gprof: a call graph execution profiler''" in
438: \fIProc. SIGPLAN '82 Symp. on Compiler Construction, SIGPLAN Notices\fP
439: 17(4), pp. 120-126, 1982.
440: .IP [Johnston70]
441: Johnston, T. Y., and Johnson, R. H., \fIProgram Performance
442: Measurement\fP, SLAC User Note 33, Rev. 1, Stanford University, California, 1970.
443: .IP [Unix]
444: Unix Programmer's Manual, ``prof command,'' section 1, Bell Laboratories, Murray Hill,
445: NJ, 1979.
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.