researchv10dc/cmd/sml/doc/papers/profiling/paper.ms - annotate

Return to paper.ms CVS log
Up to [Research Unix] / researchv10dc / cmd / sml / doc / papers / profiling
Annotation of researchv10dc/cmd/sml/doc/papers/profiling/paper.ms, revision 1.1.1.1

1.1       root        1: .LP
                      2: -
                      3: .ft 3
                      4: .ce 99
                      5: .sp 2i
                      6: .LG
                      7: Profiling in the Presence of Optimization and Garbage Collection
                      8: .sp
                      9: .ft 2
                     10: .ce 99
                     11: .NL
                     12: Andrew W. Appel*
                     13: Bruce F. Duba\(dg
                     14: David B. MacQueen\(dg
                     15: .sp 0.6i
                     16: .ce 99
                     17: .NL
                     18: November 1988
                     19: .sp 1i
                     20: .ce
                     21: .ft  2
                     22: ABSTRACT
                     23: .ft 1
                     24: .IP
                     25: Profiling the execution of programs can be a great help in tuning their
                     26: performance, and programs written in functional languages are no exception.
                     27: The standard techniques of call-counting and statistical (interrupt-driven)
                     28: execution time measurement work well, but with some modification.  In
                     29: particular, the program counter is not the best indicator of ``current
                     30: function.''
                     31: Our profiler inserts explicit increment and assignment statements into
                     32: the intermediate representation, and is therefore very simple to implement
                     33: and completely independent of the code-generator.
                     34: .LP
                     35: .sp 1i
                     36: .nr PS  8
                     37: .nr VS 10
                     38: .LP
                     39: * Supported in part by NSF Grant CCR-8806121 and by a Digital Equipment Corp.
                     40: Faculty Incentive Grant.
                     41: .LP
                     42: \(dg AT&T Bell Laboratories, Murray Hill, NJ.
                     43: .nr PS 10
                     44: .nr VS 16
                     45: .LP
                     46: .bp
                     47: .NH
                     48: Execution profiling
                     49: .LP
                     50: A large program usually consists of many small functions.  
                     51: When such a program is to be tuned for efficiency, it is necessary to
                     52: identify which of those functions are taking the bulk of the execution time.
                     53: Then the commonly-used functions can be made more efficient, or called less
                     54: often, or both.  By using a theoretical analysis of the algorithms used
                     55: in a program, such functions can be identified; but a complete
                     56: theoretical analysis is complex and
                     57: impractical for large programs.
                     58: 
                     59: An execution profiler provides an empirical measurement of the time spent
                     60: in each function.  A widely-used Unix tool, \fBprof\fP [Unix],
                     61: provides a count of how many times each function is called, and how many
                     62: seconds were spent in each function.   This information is very useful
                     63: in identifying which functions are in need of improvement, after which
                     64: a theoretical analysis of just those functions might be carried out,
                     65: a much less forbidding endeavor than analyzing the whole program.
                     66: 
                     67: \fBProf\fP gathers
                     68: call-count information 
                     69: by having the compiler insert at the beginning of each function
                     70: an instruction that increments a call-count variable associated with
                     71: the function.  An approximation to the amount
                     72: of total time spent in the function is gathered by the use of a timer
                     73: interrupt:  every 1/60th of a second, the operating system notes
                     74: in a ``histogram'' array the value of the program counter.
                     75: (This is an ancient technique [Johnston70].)
                     76: Then, at the end of program execution, \fBprof\fP estimates
                     77: the amount of time spent in each
                     78: function by summing the values in the histogram array
                     79: corresponding to program counter samples between the beginning and end of the
                     80: machine code for that function.
                     81: The interrupt-driven sampling method has a much lower overhead than
                     82: querying a clock on each entry to and exit from a function.
                     83: 
                     84: A more elaborate profiling tool, \fBgprof\fP [Graham82],
                     85: provides even more information.  
                     86: When one primitive function (e.g. a
                     87: table-lookup routine) is used in many places, it is useful to know,
                     88: not only the total time for the execution of the primitive, but also, how
                     89: much time to ``charge'' the calling functions. With \fBgprof\fP this is
                     90: approximated by keeping a count of the number of
                     91: times each call-site is used, and on that basis apportioning the
                     92: average execution time
                     93: of the called function to the functions that call it.
                     94: 
                     95: These approaches to execution-time measurement and apportionment pose certain
                     96: problems for optimizing compilers and for functional languages:
                     97: .IP 1.
                     98: The machine-code for a function is not necessarily all contiguous.
                     99: A function may be turned into several pieces of code, with portions of
                    100: the code for other functions interspersed.  This problem could certainly
                    101: be solved by elaborate bookkeeping in the optimizer and code generator,
                    102: but we wanted to avoid that complexity.
                    103: .IP 2.
                    104: An optimizer can expand functions in-line in other functions.
                    105: The program-counter method will charge the calling function
                    106: instead of the called function,
                    107: even though it might be desirable for in-line expansion to be made
                    108: semantically invisible.
                    109: .IP 3.
                    110: The histogram array in \fBprof\fP
                    111: must be proportional in size to the address range spanned by pieces of
                    112: code for executable functions.  Our runtime
                    113: system intersperses code and data
                    114: throughout memory; even worse, it periodically garbage collects, moving
                    115: code and data from place to place.  This problem could have been solved
                    116: by elaborate bookkeeping in the runtime system, which we also wanted to avoid.
                    117: .LP
                    118: We had to deal with these problems in the course of implementing a
                    119: profiler for an optimizing compiler for the functional language
                    120: Standard ML [Appel87].  The approach we used is described in the next
                    121: section.
                    122: .NH
                    123: Intermediate Representation of call-counting and current-function
                    124: .LP
                    125: For execution-time estimation we use a timer interrupt, as does
                    126: \fBprof\fP, to increment a histogram entry.  However, we don't use the
                    127: program counter to calculate which histogram entry to increment.
                    128: Instead, we maintain a ``pointer-to-current-function-entry'' in a
                    129: global variable called \fBcurrent\fP that is accessible to the
                    130: timer-interrupt handler.  Each function has associated with it two
                    131: auxiliary variables: a call-count and an interrupt-count.  On entry to
                    132: a function, it increments the call-count and assigns the address of
                    133: the interrupt-count variable into the global \fBcurrent\fP variable.
                    134: Then, when a timer interrupt occurs, the interrupt handler just increments the
                    135: variable that \fBcurrent\fP points to.
                    136: 
                    137: When a function returns \(em either normally or via an exception \(em
                    138: \fBcurrent\fP must be set back to the interrupt-count variable of the
                    139: function that it is returning to.  This resetting could be done either
                    140: by the calling function (after the called function has returned), or by
                    141: the called function before exit.  For several reasons, it is better done by
                    142: the calling function.  If the called function does the reset,  a stack
                    143: of current-function pointers is required; this is expensive to maintain.
                    144: A stack of current-pointers would also greatly complicate the treatment
                    145: of exception-handlers; with the caller-reset method, the exception-handler
                    146: justs sets the \fBcurrent\fP to point at the appropriate counter variable
                    147: on entry.
                    148: On recursive calls, \fBcurrent\fP need not be reset (as the calling
                    149: and called function are the same), but only the calling function knows
                    150: which calls are recursive.  And finally, tail-calls can be optimized
                    151: if the caller resets \fBcurrent\fP.
                    152: 
                    153: A tail-call is one that is not followed by any
                    154: executable code before the function returns.  After a tail-call, the
                    155: function will immediately return and therefore \fBcurrent\fP will
                    156: immediately be reset.  Therefore, it is not necessary for the calling
                    157: function to reset \fBcurrent\fP after a tail-call.
                    158: This is a useful optimization, and it is particularly important
                    159: when used with a compiler that optimizes tail-calls into jumps; if the
                    160: current pointer had to be reset after the tail-call, it would no
                    161: longer be a tail-call and performance would suffer dramatically.
                    162: Fortunately, it is easy to identify tail-calls statically as the
                    163: profiling instructions are being inserted.
                    164: 
                    165: We insert the profiling instructions as ordinary assignment statements
                    166: in the intermediate representation.  In almost any compiler's
                    167: intermediate representation it is easy to represent the operations
                    168: of fetching, adding one, and storing, for the call-count increment operation;
                    169: and storing, for the assignment to the \fBcurrent\fP variable.
                    170: 
                    171: Functional programming languages introduce another problem for the
                    172: design of profilers: what to do with anonymous, first-class functions.
                    173: The simplest choice is to do nothing; collect no call-counts and let
                    174: the time be charged to the caller of the unnamed function. The main
                    175: disadvantage of this solution, besides not having the call-counts, is
                    176: that there is no convenient way to find the code that contributes to
                    177: the cost of a profiled function that calls anonymous functions.
                    178: 
                    179: Probably the most general solution is to make up names for the unnamed
                    180: functions (for example, an unnamed function statically enclosed in function
                    181: \fIf\fP might be called \fIf.anon\fP).
                    182: If anonymous functions are given names they can be treated
                    183: just as any other function; call counts and execution time will be
                    184: reported.  Of course, the user will need to associate the new names
                    185: with the correct function, but in practice this is rarely a problem.
                    186: .NH
                    187: An example
                    188: .LP
                    189: To illustrate the technique, we present a simple example (figure 1).  The ML function
                    190: \f(CWsubset\fP takes a predicate function as an argument, and returns
                    191: a function that maps lists to lists; the output list will be that sublist
                    192: of the input list containing just those elements that satisfy the predicate.  
                    193: The user's program is displayed in typewriter font; the compiler puts some
                    194: scaffolding around it (indicated in italics) to make a record
                    195: containing all the functions declared by the user.
                    196: .KF
                    197: .DS
                    198: .ft CW
                    199: \fIlet\fP fun subset pred =
                    200:        let fun f nil = nil
                    201:              | f (a::r) = if pred a then a::f(r) else f(r)
                    202:         in f
                    203:        end
                    204: 
                    205:    fun isPrime x = 
                    206:        let fun test i = i>=x orelse (x mod i <> 0 andalso test(i+1))
                    207:         in test 2
                    208:        end
                    209: 
                    210:    val primes = subset isPrime
                    211: \fI in (subset, isPrime, primes)
                    212: end\fP
                    213: .ft R
                    214: .DE
                    215: .DS C
                    216: Figure 1.
                    217: .DE
                    218: .KE
                    219: If this code is compiled with profiling enabled, the compiler inserts
                    220: the call-counting and current-function instructions into the intermediate
                    221: representation.  Here, we display the effects as if written in the source language
                    222: (figure 2).
                    223: .KF
                    224: .DS
                    225: .ft CW
                    226: \fIlet val subset.CC = ref 0 and subset.IC = ref 0
                    227:     and subset.f.CC = ref 0 and subset.f.IC = ref 0
                    228:     and isPrime.CC = ref 0 and isPrime.IC = ref 0
                    229:     and isPrime.test.CC = ref 0 and isPrime.test.IC = ref 0
                    230: \fP
                    231:    fun subset pred =
                    232:        \fI(subset.CC := !subset.CC + 1;\fP
                    233:        \fIcurrent := subset.IC;\fP
                    234:        let fun f x =
                    235:            \fI(subset.f.CC := !subset.f.CC + 1;\fP
                    236:            \fIcurrent := subset.f.IC;\fP
                    237:            case x of
                    238:              nil => nil
                    239:            | a::r => let val pa = pred a
                    240:                       in \fIcurrent := subset.f.IC;\fP
                    241:                          if pa then a :: f(r) else f(r)
                    242:                      end
                    243:         in f
                    244:        end
                    245: 
                    246:    fun isPrime x = \fI(isPrime.CC := !isPrime.CC + 1;\fP
                    247:                    \fIcurrent := isPrime.IC;\fP
                    248:                    . . . )
                    249: 
                    250:    val primes = subset isPrime
                    251: \fI
                    252:  in ((subset, isPrime, primes),
                    253:      ((subset.CC, subset.IC, "subset"),
                    254:       (subset.f.CC, subset.f.IC, "subset.f"),
                    255:       (isPrime.test.CC, isPrime.test.IC, "isPrime.test"),
                    256:       (isPrime.CC, isPrime.IC, "isPrime")))
                    257: end\fP
                    258: .ft R
                    259: .DE
                    260: .DS C
                    261: Figure 2.
                    262: .DE
                    263: .KE
                    264: 
                    265: For each function, two variables are introduced:  a call-count and an interrupt-count.
                    266: On entry to a function, the call-count is incremented, and the global variable
                    267: \fBcurrent\fP is set to point to the interrupt-count.  On re-entry to a function
                    268: after a subroutine call, \fBcurrent\fP is reset to the function's
                    269: interrupt-count variable.  However, this is not necessary after recursive calls
                    270: and tail calls, e.g. the calls to \f(CWf\fP.
                    271: 
                    272: The initial \fIlet\fP-bindings create all the count variables, and the
                    273: last four lines produce, instead of just a record containing the user's
                    274: declared objects, a pair of records:  the user's declared objects, and a
                    275: list of records containing profiling variables, each with an identifying
                    276: string constant.  These string constants will be embedded in the executable
                    277: code for this module, and will enable the call-count variables to be self-identifying.
                    278: Our runtime system maintains a global list of these
                    279: 3-element records; when it is time to print an execution profile, they are
                    280: sorted in decreasing order of interrupt-count.
                    281: 
                    282: .KS
                    283: Our output looks like the output of \fBprof\fP:
                    284: .DS
                    285: .ft CW
                    286: %time  cumsecs   #call ms/call  name
                    287:  90.4     3.52   78189    .045  isPrime.test
                    288:   8.4     3.85    1000    .330  isPrime
                    289:    .7     3.88       0          (unprofiled)
                    290:    .2     3.89    1001    .009  subset.f
                    291:    .0     3.89    1001    .000  natlist
                    292:    .0     3.89       1    .000  subset
                    293: .ft R
                    294: .DE
                    295: .KE
                    296: 
                    297: Now, armed with this information, a programmer
                    298: might decide that it is worthwhile
                    299: re-writing the \f(CWisPrime\fP function
                    300: to make it as efficient as possible.  But at a certain point
                    301: the programmer will want to know what functions are calling \f(CWisPrime\fP
                    302: so he can make them call it less often.  By re-compiling with
                    303: \f(CWisPrime\fP unprofiled, any time spent in \f(CWisPrime\fP will
                    304: now be charged to the function that called it.  This is because
                    305: \f(CWisPrime\fP will not change the \fBcurrent\fP variable, so that the
                    306: timer-interrupt will increment the  count for the function that last
                    307: set \fBcurrent\fP \(em and this will be the one that called \fPisPrime\fP.
                    308: The profiling system won't do this automatically, but by comparing two
                    309: different execution profiles, one with \f(CWisPrime\fP compiled with
                    310: profile instructions and one with \f(CWisPrime\fP unprofiled, an accurate estimate
                    311: can be made of who is calling it.
                    312: .NH 
                    313: Advantages of our current-function method
                    314: .LP
                    315: Since we use ordinary intermediate-representation operators
                    316: for profiling,
                    317: the optimizer and code-generator ``believe'' that 
                    318: profiling operations are part of the program.
                    319: Since an optimizer must not modify the semantics of the program,
                    320: the semantics of profiling will not be modified either.
                    321: Therefore, if one function is copied and inserted in-line into another,
                    322: the call-count and current-function instructions will be copied and
                    323: inserted at the right place.  Other optimizations that break functions
                    324: into several disjoint pieces of code will leave the profiling
                    325: instructions in the appropriate places.
                    326: 
                    327: Furthermore, the result is that the implementation of the profiler
                    328: is completely independent of the code generator.  We have four different
                    329: code generators for our compiler (two different algorithms each for the
                    330: Vax and the Motorola 68020), and not a line of any of them was modified
                    331: for the installation of the profiler.
                    332: 
                    333: By compiling some functions unprofiled, as described in the previous section,
                    334: we can find out what callers are responsible for most of their execution time.
                    335: This kind of trick serves much the same purpose that the more elaborate
                    336: program \fBgprof\fP does; and it's a trick that wouldn't work with
                    337: a program-counter histogram.
                    338: Furthermore, our method is more accurate than \fBgprof\fP.  Suppose
                    339: functions \fIf\fP and \fIg\fP both call a function \fIisPrime\fP, but
                    340: \fIf\fP consistently makes expensive calls (that take a long time) while
                    341: \fIg\fP makes cheap ones.  \fBGprof\fP allocates the total time spent in
                    342: \fIisPrime\fP on the basis of call counts from \fIf\fP and \fIg\fP;
                    343: this will miss the fact that \fIf\fP is responsible for most of the cost.
                    344: In this example, when profiling for \fIisPrime\fP is turned off, \fIf\fP
                    345: and \fIg\fP will be charged for the actual time spent in \fIisPrime\fP
                    346: on their behalf.  (On the other hand, \fBGprof\fP will give an accurate
                    347: breakdown of call-site counts that our method does not provide.)
                    348: 
                    349: If a profiled function calls an unprofiled function, then during the
                    350: execution of the called function, all timer interrupts will be charged
                    351: to the caller (since \fBcurrent\fP still points to the caller's
                    352: variable).  This is often desirable, as described above.  But if an
                    353: unprofiled function calls a profiled function, then upon return to the
                    354: unprofiled function the \fBcurrent\fP pointer won't be reset, and
                    355: interrupts will continue to be charged to the called function after it
                    356: has returned.  This is undesirable, and should be prevented by the
                    357: compiler.  In a language with first-class functions, it is difficult
                    358: to prevent profiled functions from being passed as arguments to
                    359: unprofiled functions that might then call them.  In practice, this has
                    360: not proved to be a problem, probably because unprofiled functionals
                    361: are typically simple primitives like \fIapp\fP and \fImap\fP, which do
                    362: little intrinsic computation.
                    363: .NH
                    364: Overhead measurements
                    365: .LP
                    366: We ran the same program several times with various of our profiling
                    367: features enabled; this gives a reasonably accurate measurement of profiling
                    368: overhead:
                    369: .KF
                    370: .TS
                    371: tab(|) box center;
                    372: l c c
                    373: l n n.
                    374: |Time|%Overhead|LenL|LenR|Comp
                    375: _
                    376: User code|2801 sec|
                    377: _
                    378: Call counts|568|20.3%
                    379: Setting current function|286|10.2
                    380: Interrupts|47|1.7
                    381: _
                    382: Total Overhead|901|32.2%
                    383: .TE
                    384: .KE
                    385: The total overhead of 32% is not prohibitively expensive.  Our code generator
                    386: takes three instructions to increment a call-count (fetch, add, store);
                    387: a better instruction-selector could probably reduce this overhead to 8%, and the
                    388: total overhead to 20%.
                    389: 
                    390: .KF
                    391: There is also an implementation overhead; it turned out to be fairly simple
                    392: to get this profiler running.
                    393: .TS
                    394: tab(|) center;
                    395: l n.
                    396: Insertion of profiling instructions|49 lines
                    397: Interrupt handling|32
                    398: Global database|16
                    399: Report generation|72
                    400: _
                    401: Total|169 lines
                    402: .TE
                    403: In contrast, this paper is about 500 lines long.
                    404: .KE
                    405: .NH
                    406: Conclusion
                    407: .LP
                    408: Traditional approaches to profiling run into problems when we attempt to apply
                    409: them to functional languages where code may be moved around by garbage collection,
                    410: and the task is further complicated when an optimizing compiler freely rearranges
                    411: the code.  The basic difficulty is that the mapping between the current pc and the
                    412: currently executing function is difficult to maintain.  
                    413: 
                    414: We have found a simple way around this difficulty, which consists of
                    415: maintaining a global variable that always points to the interrupt
                    416: count for the current function, and which is to be charged whenever
                    417: there is a timer interrupt.  Because we manipulate this variable in the
                    418: intermediate representation of the compiler, our method is very easy
                    419: to implement and has no nasty interactions with code generation or
                    420: garbage collection algorithms (which already preserve semantics of
                    421: intermediate-representation operations).
                    422: 
                    423: This method has acceptable overhead and accuracy.  Furthermore, by
                    424: judiciously mixing profiled and unprofiled functions, one can extract
                    425: information on inherited costs as well as the direct costs of calling
                    426: particular functions.  This information is similar to that provided by
                    427: sophisticated profilers like gprof, but is more accurate.
                    428: .SH
                    429: References
                    430: .LP
                    431: .IP [Appel87] 1i
                    432: Appel, Andrew W. and MacQueen, David B.  ``A Standard ML compiler,''
                    433: in \fIFunctional Programming Languages and Computer Architecture\fP,
                    434: LNCS 274, G. Kahn, ed., pp 301-324, 1987
                    435: .IP [Graham82]
                    436: Graham, Susan L. Graham, Peter B. Kessler, and Marshall K. McKusick.
                    437: ``gprof: a call graph execution profiler''" in 
                    438: \fIProc. SIGPLAN '82 Symp. on Compiler Construction, SIGPLAN Notices\fP
                    439: 17(4), pp. 120-126, 1982.
                    440: .IP [Johnston70]
                    441: Johnston, T. Y., and Johnson, R. H., \fIProgram Performance
                    442: Measurement\fP, SLAC User Note 33, Rev. 1, Stanford University, California, 1970.
                    443: .IP [Unix]
                    444: Unix Programmer's Manual, ``prof command,'' section 1, Bell Laboratories, Murray Hill,
                    445: NJ, 1979.
unix.superglobalmegacorp.com
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.