|
|
1.1 ! root 1: .LP ! 2: - ! 3: .ft 3 ! 4: .ce 99 ! 5: .sp 2i ! 6: .LG ! 7: Profiling in the Presence of Optimization and Garbage Collection ! 8: .sp ! 9: .ft 2 ! 10: .ce 99 ! 11: .NL ! 12: Andrew W. Appel* ! 13: Bruce F. Duba\(dg ! 14: David B. MacQueen\(dg ! 15: .sp 0.6i ! 16: .ce 99 ! 17: .NL ! 18: November 1988 ! 19: .sp 1i ! 20: .ce ! 21: .ft 2 ! 22: ABSTRACT ! 23: .ft 1 ! 24: .IP ! 25: Profiling the execution of programs can be a great help in tuning their ! 26: performance, and programs written in functional languages are no exception. ! 27: The standard techniques of call-counting and statistical (interrupt-driven) ! 28: execution time measurement work well, but with some modification. In ! 29: particular, the program counter is not the best indicator of ``current ! 30: function.'' ! 31: Our profiler inserts explicit increment and assignment statements into ! 32: the intermediate representation, and is therefore very simple to implement ! 33: and completely independent of the code-generator. ! 34: .LP ! 35: .sp 1i ! 36: .nr PS 8 ! 37: .nr VS 10 ! 38: .LP ! 39: * Supported in part by NSF Grant CCR-8806121 and by a Digital Equipment Corp. ! 40: Faculty Incentive Grant. ! 41: .LP ! 42: \(dg AT&T Bell Laboratories, Murray Hill, NJ. ! 43: .nr PS 10 ! 44: .nr VS 16 ! 45: .LP ! 46: .bp ! 47: .NH ! 48: Execution profiling ! 49: .LP ! 50: A large program usually consists of many small functions. ! 51: When such a program is to be tuned for efficiency, it is necessary to ! 52: identify which of those functions are taking the bulk of the execution time. ! 53: Then the commonly-used functions can be made more efficient, or called less ! 54: often, or both. By using a theoretical analysis of the algorithms used ! 55: in a program, such functions can be identified; but a complete ! 56: theoretical analysis is complex and ! 57: impractical for large programs. ! 58: ! 59: An execution profiler provides an empirical measurement of the time spent ! 60: in each function. A widely-used Unix tool, \fBprof\fP [Unix], ! 61: provides a count of how many times each function is called, and how many ! 62: seconds were spent in each function. This information is very useful ! 63: in identifying which functions are in need of improvement, after which ! 64: a theoretical analysis of just those functions might be carried out, ! 65: a much less forbidding endeavor than analyzing the whole program. ! 66: ! 67: \fBProf\fP gathers ! 68: call-count information ! 69: by having the compiler insert at the beginning of each function ! 70: an instruction that increments a call-count variable associated with ! 71: the function. An approximation to the amount ! 72: of total time spent in the function is gathered by the use of a timer ! 73: interrupt: every 1/60th of a second, the operating system notes ! 74: in a ``histogram'' array the value of the program counter. ! 75: (This is an ancient technique [Johnston70].) ! 76: Then, at the end of program execution, \fBprof\fP estimates ! 77: the amount of time spent in each ! 78: function by summing the values in the histogram array ! 79: corresponding to program counter samples between the beginning and end of the ! 80: machine code for that function. ! 81: The interrupt-driven sampling method has a much lower overhead than ! 82: querying a clock on each entry to and exit from a function. ! 83: ! 84: A more elaborate profiling tool, \fBgprof\fP [Graham82], ! 85: provides even more information. ! 86: When one primitive function (e.g. a ! 87: table-lookup routine) is used in many places, it is useful to know, ! 88: not only the total time for the execution of the primitive, but also, how ! 89: much time to ``charge'' the calling functions. With \fBgprof\fP this is ! 90: approximated by keeping a count of the number of ! 91: times each call-site is used, and on that basis apportioning the ! 92: average execution time ! 93: of the called function to the functions that call it. ! 94: ! 95: These approaches to execution-time measurement and apportionment pose certain ! 96: problems for optimizing compilers and for functional languages: ! 97: .IP 1. ! 98: The machine-code for a function is not necessarily all contiguous. ! 99: A function may be turned into several pieces of code, with portions of ! 100: the code for other functions interspersed. This problem could certainly ! 101: be solved by elaborate bookkeeping in the optimizer and code generator, ! 102: but we wanted to avoid that complexity. ! 103: .IP 2. ! 104: An optimizer can expand functions in-line in other functions. ! 105: The program-counter method will charge the calling function ! 106: instead of the called function, ! 107: even though it might be desirable for in-line expansion to be made ! 108: semantically invisible. ! 109: .IP 3. ! 110: The histogram array in \fBprof\fP ! 111: must be proportional in size to the address range spanned by pieces of ! 112: code for executable functions. Our runtime ! 113: system intersperses code and data ! 114: throughout memory; even worse, it periodically garbage collects, moving ! 115: code and data from place to place. This problem could have been solved ! 116: by elaborate bookkeeping in the runtime system, which we also wanted to avoid. ! 117: .LP ! 118: We had to deal with these problems in the course of implementing a ! 119: profiler for an optimizing compiler for the functional language ! 120: Standard ML [Appel87]. The approach we used is described in the next ! 121: section. ! 122: .NH ! 123: Intermediate Representation of call-counting and current-function ! 124: .LP ! 125: For execution-time estimation we use a timer interrupt, as does ! 126: \fBprof\fP, to increment a histogram entry. However, we don't use the ! 127: program counter to calculate which histogram entry to increment. ! 128: Instead, we maintain a ``pointer-to-current-function-entry'' in a ! 129: global variable called \fBcurrent\fP that is accessible to the ! 130: timer-interrupt handler. Each function has associated with it two ! 131: auxiliary variables: a call-count and an interrupt-count. On entry to ! 132: a function, it increments the call-count and assigns the address of ! 133: the interrupt-count variable into the global \fBcurrent\fP variable. ! 134: Then, when a timer interrupt occurs, the interrupt handler just increments the ! 135: variable that \fBcurrent\fP points to. ! 136: ! 137: When a function returns \(em either normally or via an exception \(em ! 138: \fBcurrent\fP must be set back to the interrupt-count variable of the ! 139: function that it is returning to. This resetting could be done either ! 140: by the calling function (after the called function has returned), or by ! 141: the called function before exit. For several reasons, it is better done by ! 142: the calling function. If the called function does the reset, a stack ! 143: of current-function pointers is required; this is expensive to maintain. ! 144: A stack of current-pointers would also greatly complicate the treatment ! 145: of exception-handlers; with the caller-reset method, the exception-handler ! 146: justs sets the \fBcurrent\fP to point at the appropriate counter variable ! 147: on entry. ! 148: On recursive calls, \fBcurrent\fP need not be reset (as the calling ! 149: and called function are the same), but only the calling function knows ! 150: which calls are recursive. And finally, tail-calls can be optimized ! 151: if the caller resets \fBcurrent\fP. ! 152: ! 153: A tail-call is one that is not followed by any ! 154: executable code before the function returns. After a tail-call, the ! 155: function will immediately return and therefore \fBcurrent\fP will ! 156: immediately be reset. Therefore, it is not necessary for the calling ! 157: function to reset \fBcurrent\fP after a tail-call. ! 158: This is a useful optimization, and it is particularly important ! 159: when used with a compiler that optimizes tail-calls into jumps; if the ! 160: current pointer had to be reset after the tail-call, it would no ! 161: longer be a tail-call and performance would suffer dramatically. ! 162: Fortunately, it is easy to identify tail-calls statically as the ! 163: profiling instructions are being inserted. ! 164: ! 165: We insert the profiling instructions as ordinary assignment statements ! 166: in the intermediate representation. In almost any compiler's ! 167: intermediate representation it is easy to represent the operations ! 168: of fetching, adding one, and storing, for the call-count increment operation; ! 169: and storing, for the assignment to the \fBcurrent\fP variable. ! 170: ! 171: Functional programming languages introduce another problem for the ! 172: design of profilers: what to do with anonymous, first-class functions. ! 173: The simplest choice is to do nothing; collect no call-counts and let ! 174: the time be charged to the caller of the unnamed function. The main ! 175: disadvantage of this solution, besides not having the call-counts, is ! 176: that there is no convenient way to find the code that contributes to ! 177: the cost of a profiled function that calls anonymous functions. ! 178: ! 179: Probably the most general solution is to make up names for the unnamed ! 180: functions (for example, an unnamed function statically enclosed in function ! 181: \fIf\fP might be called \fIf.anon\fP). ! 182: If anonymous functions are given names they can be treated ! 183: just as any other function; call counts and execution time will be ! 184: reported. Of course, the user will need to associate the new names ! 185: with the correct function, but in practice this is rarely a problem. ! 186: .NH ! 187: An example ! 188: .LP ! 189: To illustrate the technique, we present a simple example (figure 1). The ML function ! 190: \f(CWsubset\fP takes a predicate function as an argument, and returns ! 191: a function that maps lists to lists; the output list will be that sublist ! 192: of the input list containing just those elements that satisfy the predicate. ! 193: The user's program is displayed in typewriter font; the compiler puts some ! 194: scaffolding around it (indicated in italics) to make a record ! 195: containing all the functions declared by the user. ! 196: .KF ! 197: .DS ! 198: .ft CW ! 199: \fIlet\fP fun subset pred = ! 200: let fun f nil = nil ! 201: | f (a::r) = if pred a then a::f(r) else f(r) ! 202: in f ! 203: end ! 204: ! 205: fun isPrime x = ! 206: let fun test i = i>=x orelse (x mod i <> 0 andalso test(i+1)) ! 207: in test 2 ! 208: end ! 209: ! 210: val primes = subset isPrime ! 211: \fI in (subset, isPrime, primes) ! 212: end\fP ! 213: .ft R ! 214: .DE ! 215: .DS C ! 216: Figure 1. ! 217: .DE ! 218: .KE ! 219: If this code is compiled with profiling enabled, the compiler inserts ! 220: the call-counting and current-function instructions into the intermediate ! 221: representation. Here, we display the effects as if written in the source language ! 222: (figure 2). ! 223: .KF ! 224: .DS ! 225: .ft CW ! 226: \fIlet val subset.CC = ref 0 and subset.IC = ref 0 ! 227: and subset.f.CC = ref 0 and subset.f.IC = ref 0 ! 228: and isPrime.CC = ref 0 and isPrime.IC = ref 0 ! 229: and isPrime.test.CC = ref 0 and isPrime.test.IC = ref 0 ! 230: \fP ! 231: fun subset pred = ! 232: \fI(subset.CC := !subset.CC + 1;\fP ! 233: \fIcurrent := subset.IC;\fP ! 234: let fun f x = ! 235: \fI(subset.f.CC := !subset.f.CC + 1;\fP ! 236: \fIcurrent := subset.f.IC;\fP ! 237: case x of ! 238: nil => nil ! 239: | a::r => let val pa = pred a ! 240: in \fIcurrent := subset.f.IC;\fP ! 241: if pa then a :: f(r) else f(r) ! 242: end ! 243: in f ! 244: end ! 245: ! 246: fun isPrime x = \fI(isPrime.CC := !isPrime.CC + 1;\fP ! 247: \fIcurrent := isPrime.IC;\fP ! 248: . . . ) ! 249: ! 250: val primes = subset isPrime ! 251: \fI ! 252: in ((subset, isPrime, primes), ! 253: ((subset.CC, subset.IC, "subset"), ! 254: (subset.f.CC, subset.f.IC, "subset.f"), ! 255: (isPrime.test.CC, isPrime.test.IC, "isPrime.test"), ! 256: (isPrime.CC, isPrime.IC, "isPrime"))) ! 257: end\fP ! 258: .ft R ! 259: .DE ! 260: .DS C ! 261: Figure 2. ! 262: .DE ! 263: .KE ! 264: ! 265: For each function, two variables are introduced: a call-count and an interrupt-count. ! 266: On entry to a function, the call-count is incremented, and the global variable ! 267: \fBcurrent\fP is set to point to the interrupt-count. On re-entry to a function ! 268: after a subroutine call, \fBcurrent\fP is reset to the function's ! 269: interrupt-count variable. However, this is not necessary after recursive calls ! 270: and tail calls, e.g. the calls to \f(CWf\fP. ! 271: ! 272: The initial \fIlet\fP-bindings create all the count variables, and the ! 273: last four lines produce, instead of just a record containing the user's ! 274: declared objects, a pair of records: the user's declared objects, and a ! 275: list of records containing profiling variables, each with an identifying ! 276: string constant. These string constants will be embedded in the executable ! 277: code for this module, and will enable the call-count variables to be self-identifying. ! 278: Our runtime system maintains a global list of these ! 279: 3-element records; when it is time to print an execution profile, they are ! 280: sorted in decreasing order of interrupt-count. ! 281: ! 282: .KS ! 283: Our output looks like the output of \fBprof\fP: ! 284: .DS ! 285: .ft CW ! 286: %time cumsecs #call ms/call name ! 287: 90.4 3.52 78189 .045 isPrime.test ! 288: 8.4 3.85 1000 .330 isPrime ! 289: .7 3.88 0 (unprofiled) ! 290: .2 3.89 1001 .009 subset.f ! 291: .0 3.89 1001 .000 natlist ! 292: .0 3.89 1 .000 subset ! 293: .ft R ! 294: .DE ! 295: .KE ! 296: ! 297: Now, armed with this information, a programmer ! 298: might decide that it is worthwhile ! 299: re-writing the \f(CWisPrime\fP function ! 300: to make it as efficient as possible. But at a certain point ! 301: the programmer will want to know what functions are calling \f(CWisPrime\fP ! 302: so he can make them call it less often. By re-compiling with ! 303: \f(CWisPrime\fP unprofiled, any time spent in \f(CWisPrime\fP will ! 304: now be charged to the function that called it. This is because ! 305: \f(CWisPrime\fP will not change the \fBcurrent\fP variable, so that the ! 306: timer-interrupt will increment the count for the function that last ! 307: set \fBcurrent\fP \(em and this will be the one that called \fPisPrime\fP. ! 308: The profiling system won't do this automatically, but by comparing two ! 309: different execution profiles, one with \f(CWisPrime\fP compiled with ! 310: profile instructions and one with \f(CWisPrime\fP unprofiled, an accurate estimate ! 311: can be made of who is calling it. ! 312: .NH ! 313: Advantages of our current-function method ! 314: .LP ! 315: Since we use ordinary intermediate-representation operators ! 316: for profiling, ! 317: the optimizer and code-generator ``believe'' that ! 318: profiling operations are part of the program. ! 319: Since an optimizer must not modify the semantics of the program, ! 320: the semantics of profiling will not be modified either. ! 321: Therefore, if one function is copied and inserted in-line into another, ! 322: the call-count and current-function instructions will be copied and ! 323: inserted at the right place. Other optimizations that break functions ! 324: into several disjoint pieces of code will leave the profiling ! 325: instructions in the appropriate places. ! 326: ! 327: Furthermore, the result is that the implementation of the profiler ! 328: is completely independent of the code generator. We have four different ! 329: code generators for our compiler (two different algorithms each for the ! 330: Vax and the Motorola 68020), and not a line of any of them was modified ! 331: for the installation of the profiler. ! 332: ! 333: By compiling some functions unprofiled, as described in the previous section, ! 334: we can find out what callers are responsible for most of their execution time. ! 335: This kind of trick serves much the same purpose that the more elaborate ! 336: program \fBgprof\fP does; and it's a trick that wouldn't work with ! 337: a program-counter histogram. ! 338: Furthermore, our method is more accurate than \fBgprof\fP. Suppose ! 339: functions \fIf\fP and \fIg\fP both call a function \fIisPrime\fP, but ! 340: \fIf\fP consistently makes expensive calls (that take a long time) while ! 341: \fIg\fP makes cheap ones. \fBGprof\fP allocates the total time spent in ! 342: \fIisPrime\fP on the basis of call counts from \fIf\fP and \fIg\fP; ! 343: this will miss the fact that \fIf\fP is responsible for most of the cost. ! 344: In this example, when profiling for \fIisPrime\fP is turned off, \fIf\fP ! 345: and \fIg\fP will be charged for the actual time spent in \fIisPrime\fP ! 346: on their behalf. (On the other hand, \fBGprof\fP will give an accurate ! 347: breakdown of call-site counts that our method does not provide.) ! 348: ! 349: If a profiled function calls an unprofiled function, then during the ! 350: execution of the called function, all timer interrupts will be charged ! 351: to the caller (since \fBcurrent\fP still points to the caller's ! 352: variable). This is often desirable, as described above. But if an ! 353: unprofiled function calls a profiled function, then upon return to the ! 354: unprofiled function the \fBcurrent\fP pointer won't be reset, and ! 355: interrupts will continue to be charged to the called function after it ! 356: has returned. This is undesirable, and should be prevented by the ! 357: compiler. In a language with first-class functions, it is difficult ! 358: to prevent profiled functions from being passed as arguments to ! 359: unprofiled functions that might then call them. In practice, this has ! 360: not proved to be a problem, probably because unprofiled functionals ! 361: are typically simple primitives like \fIapp\fP and \fImap\fP, which do ! 362: little intrinsic computation. ! 363: .NH ! 364: Overhead measurements ! 365: .LP ! 366: We ran the same program several times with various of our profiling ! 367: features enabled; this gives a reasonably accurate measurement of profiling ! 368: overhead: ! 369: .KF ! 370: .TS ! 371: tab(|) box center; ! 372: l c c ! 373: l n n. ! 374: |Time|%Overhead|LenL|LenR|Comp ! 375: _ ! 376: User code|2801 sec| ! 377: _ ! 378: Call counts|568|20.3% ! 379: Setting current function|286|10.2 ! 380: Interrupts|47|1.7 ! 381: _ ! 382: Total Overhead|901|32.2% ! 383: .TE ! 384: .KE ! 385: The total overhead of 32% is not prohibitively expensive. Our code generator ! 386: takes three instructions to increment a call-count (fetch, add, store); ! 387: a better instruction-selector could probably reduce this overhead to 8%, and the ! 388: total overhead to 20%. ! 389: ! 390: .KF ! 391: There is also an implementation overhead; it turned out to be fairly simple ! 392: to get this profiler running. ! 393: .TS ! 394: tab(|) center; ! 395: l n. ! 396: Insertion of profiling instructions|49 lines ! 397: Interrupt handling|32 ! 398: Global database|16 ! 399: Report generation|72 ! 400: _ ! 401: Total|169 lines ! 402: .TE ! 403: In contrast, this paper is about 500 lines long. ! 404: .KE ! 405: .NH ! 406: Conclusion ! 407: .LP ! 408: Traditional approaches to profiling run into problems when we attempt to apply ! 409: them to functional languages where code may be moved around by garbage collection, ! 410: and the task is further complicated when an optimizing compiler freely rearranges ! 411: the code. The basic difficulty is that the mapping between the current pc and the ! 412: currently executing function is difficult to maintain. ! 413: ! 414: We have found a simple way around this difficulty, which consists of ! 415: maintaining a global variable that always points to the interrupt ! 416: count for the current function, and which is to be charged whenever ! 417: there is a timer interrupt. Because we manipulate this variable in the ! 418: intermediate representation of the compiler, our method is very easy ! 419: to implement and has no nasty interactions with code generation or ! 420: garbage collection algorithms (which already preserve semantics of ! 421: intermediate-representation operations). ! 422: ! 423: This method has acceptable overhead and accuracy. Furthermore, by ! 424: judiciously mixing profiled and unprofiled functions, one can extract ! 425: information on inherited costs as well as the direct costs of calling ! 426: particular functions. This information is similar to that provided by ! 427: sophisticated profilers like gprof, but is more accurate. ! 428: .SH ! 429: References ! 430: .LP ! 431: .IP [Appel87] 1i ! 432: Appel, Andrew W. and MacQueen, David B. ``A Standard ML compiler,'' ! 433: in \fIFunctional Programming Languages and Computer Architecture\fP, ! 434: LNCS 274, G. Kahn, ed., pp 301-324, 1987 ! 435: .IP [Graham82] ! 436: Graham, Susan L. Graham, Peter B. Kessler, and Marshall K. McKusick. ! 437: ``gprof: a call graph execution profiler''" in ! 438: \fIProc. SIGPLAN '82 Symp. on Compiler Construction, SIGPLAN Notices\fP ! 439: 17(4), pp. 120-126, 1982. ! 440: .IP [Johnston70] ! 441: Johnston, T. Y., and Johnson, R. H., \fIProgram Performance ! 442: Measurement\fP, SLAC User Note 33, Rev. 1, Stanford University, California, 1970. ! 443: .IP [Unix] ! 444: Unix Programmer's Manual, ``prof command,'' section 1, Bell Laboratories, Murray Hill, ! 445: NJ, 1979.
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.