researchv10dc/cmd/sml/doc/papers/profiling/paper.ms - annotate

Return to paper.ms CVS log
Up to [Research Unix] / researchv10dc / cmd / sml / doc / papers / profiling
Annotation of researchv10dc/cmd/sml/doc/papers/profiling/paper.ms, revision 1.1

1.1     ! root        1: .LP
        !             2: -
        !             3: .ft 3
        !             4: .ce 99
        !             5: .sp 2i
        !             6: .LG
        !             7: Profiling in the Presence of Optimization and Garbage Collection
        !             8: .sp
        !             9: .ft 2
        !            10: .ce 99
        !            11: .NL
        !            12: Andrew W. Appel*
        !            13: Bruce F. Duba\(dg
        !            14: David B. MacQueen\(dg
        !            15: .sp 0.6i
        !            16: .ce 99
        !            17: .NL
        !            18: November 1988
        !            19: .sp 1i
        !            20: .ce
        !            21: .ft  2
        !            22: ABSTRACT
        !            23: .ft 1
        !            24: .IP
        !            25: Profiling the execution of programs can be a great help in tuning their
        !            26: performance, and programs written in functional languages are no exception.
        !            27: The standard techniques of call-counting and statistical (interrupt-driven)
        !            28: execution time measurement work well, but with some modification.  In
        !            29: particular, the program counter is not the best indicator of ``current
        !            30: function.''
        !            31: Our profiler inserts explicit increment and assignment statements into
        !            32: the intermediate representation, and is therefore very simple to implement
        !            33: and completely independent of the code-generator.
        !            34: .LP
        !            35: .sp 1i
        !            36: .nr PS  8
        !            37: .nr VS 10
        !            38: .LP
        !            39: * Supported in part by NSF Grant CCR-8806121 and by a Digital Equipment Corp.
        !            40: Faculty Incentive Grant.
        !            41: .LP
        !            42: \(dg AT&T Bell Laboratories, Murray Hill, NJ.
        !            43: .nr PS 10
        !            44: .nr VS 16
        !            45: .LP
        !            46: .bp
        !            47: .NH
        !            48: Execution profiling
        !            49: .LP
        !            50: A large program usually consists of many small functions.  
        !            51: When such a program is to be tuned for efficiency, it is necessary to
        !            52: identify which of those functions are taking the bulk of the execution time.
        !            53: Then the commonly-used functions can be made more efficient, or called less
        !            54: often, or both.  By using a theoretical analysis of the algorithms used
        !            55: in a program, such functions can be identified; but a complete
        !            56: theoretical analysis is complex and
        !            57: impractical for large programs.
        !            58: 
        !            59: An execution profiler provides an empirical measurement of the time spent
        !            60: in each function.  A widely-used Unix tool, \fBprof\fP [Unix],
        !            61: provides a count of how many times each function is called, and how many
        !            62: seconds were spent in each function.   This information is very useful
        !            63: in identifying which functions are in need of improvement, after which
        !            64: a theoretical analysis of just those functions might be carried out,
        !            65: a much less forbidding endeavor than analyzing the whole program.
        !            66: 
        !            67: \fBProf\fP gathers
        !            68: call-count information 
        !            69: by having the compiler insert at the beginning of each function
        !            70: an instruction that increments a call-count variable associated with
        !            71: the function.  An approximation to the amount
        !            72: of total time spent in the function is gathered by the use of a timer
        !            73: interrupt:  every 1/60th of a second, the operating system notes
        !            74: in a ``histogram'' array the value of the program counter.
        !            75: (This is an ancient technique [Johnston70].)
        !            76: Then, at the end of program execution, \fBprof\fP estimates
        !            77: the amount of time spent in each
        !            78: function by summing the values in the histogram array
        !            79: corresponding to program counter samples between the beginning and end of the
        !            80: machine code for that function.
        !            81: The interrupt-driven sampling method has a much lower overhead than
        !            82: querying a clock on each entry to and exit from a function.
        !            83: 
        !            84: A more elaborate profiling tool, \fBgprof\fP [Graham82],
        !            85: provides even more information.  
        !            86: When one primitive function (e.g. a
        !            87: table-lookup routine) is used in many places, it is useful to know,
        !            88: not only the total time for the execution of the primitive, but also, how
        !            89: much time to ``charge'' the calling functions. With \fBgprof\fP this is
        !            90: approximated by keeping a count of the number of
        !            91: times each call-site is used, and on that basis apportioning the
        !            92: average execution time
        !            93: of the called function to the functions that call it.
        !            94: 
        !            95: These approaches to execution-time measurement and apportionment pose certain
        !            96: problems for optimizing compilers and for functional languages:
        !            97: .IP 1.
        !            98: The machine-code for a function is not necessarily all contiguous.
        !            99: A function may be turned into several pieces of code, with portions of
        !           100: the code for other functions interspersed.  This problem could certainly
        !           101: be solved by elaborate bookkeeping in the optimizer and code generator,
        !           102: but we wanted to avoid that complexity.
        !           103: .IP 2.
        !           104: An optimizer can expand functions in-line in other functions.
        !           105: The program-counter method will charge the calling function
        !           106: instead of the called function,
        !           107: even though it might be desirable for in-line expansion to be made
        !           108: semantically invisible.
        !           109: .IP 3.
        !           110: The histogram array in \fBprof\fP
        !           111: must be proportional in size to the address range spanned by pieces of
        !           112: code for executable functions.  Our runtime
        !           113: system intersperses code and data
        !           114: throughout memory; even worse, it periodically garbage collects, moving
        !           115: code and data from place to place.  This problem could have been solved
        !           116: by elaborate bookkeeping in the runtime system, which we also wanted to avoid.
        !           117: .LP
        !           118: We had to deal with these problems in the course of implementing a
        !           119: profiler for an optimizing compiler for the functional language
        !           120: Standard ML [Appel87].  The approach we used is described in the next
        !           121: section.
        !           122: .NH
        !           123: Intermediate Representation of call-counting and current-function
        !           124: .LP
        !           125: For execution-time estimation we use a timer interrupt, as does
        !           126: \fBprof\fP, to increment a histogram entry.  However, we don't use the
        !           127: program counter to calculate which histogram entry to increment.
        !           128: Instead, we maintain a ``pointer-to-current-function-entry'' in a
        !           129: global variable called \fBcurrent\fP that is accessible to the
        !           130: timer-interrupt handler.  Each function has associated with it two
        !           131: auxiliary variables: a call-count and an interrupt-count.  On entry to
        !           132: a function, it increments the call-count and assigns the address of
        !           133: the interrupt-count variable into the global \fBcurrent\fP variable.
        !           134: Then, when a timer interrupt occurs, the interrupt handler just increments the
        !           135: variable that \fBcurrent\fP points to.
        !           136: 
        !           137: When a function returns \(em either normally or via an exception \(em
        !           138: \fBcurrent\fP must be set back to the interrupt-count variable of the
        !           139: function that it is returning to.  This resetting could be done either
        !           140: by the calling function (after the called function has returned), or by
        !           141: the called function before exit.  For several reasons, it is better done by
        !           142: the calling function.  If the called function does the reset,  a stack
        !           143: of current-function pointers is required; this is expensive to maintain.
        !           144: A stack of current-pointers would also greatly complicate the treatment
        !           145: of exception-handlers; with the caller-reset method, the exception-handler
        !           146: justs sets the \fBcurrent\fP to point at the appropriate counter variable
        !           147: on entry.
        !           148: On recursive calls, \fBcurrent\fP need not be reset (as the calling
        !           149: and called function are the same), but only the calling function knows
        !           150: which calls are recursive.  And finally, tail-calls can be optimized
        !           151: if the caller resets \fBcurrent\fP.
        !           152: 
        !           153: A tail-call is one that is not followed by any
        !           154: executable code before the function returns.  After a tail-call, the
        !           155: function will immediately return and therefore \fBcurrent\fP will
        !           156: immediately be reset.  Therefore, it is not necessary for the calling
        !           157: function to reset \fBcurrent\fP after a tail-call.
        !           158: This is a useful optimization, and it is particularly important
        !           159: when used with a compiler that optimizes tail-calls into jumps; if the
        !           160: current pointer had to be reset after the tail-call, it would no
        !           161: longer be a tail-call and performance would suffer dramatically.
        !           162: Fortunately, it is easy to identify tail-calls statically as the
        !           163: profiling instructions are being inserted.
        !           164: 
        !           165: We insert the profiling instructions as ordinary assignment statements
        !           166: in the intermediate representation.  In almost any compiler's
        !           167: intermediate representation it is easy to represent the operations
        !           168: of fetching, adding one, and storing, for the call-count increment operation;
        !           169: and storing, for the assignment to the \fBcurrent\fP variable.
        !           170: 
        !           171: Functional programming languages introduce another problem for the
        !           172: design of profilers: what to do with anonymous, first-class functions.
        !           173: The simplest choice is to do nothing; collect no call-counts and let
        !           174: the time be charged to the caller of the unnamed function. The main
        !           175: disadvantage of this solution, besides not having the call-counts, is
        !           176: that there is no convenient way to find the code that contributes to
        !           177: the cost of a profiled function that calls anonymous functions.
        !           178: 
        !           179: Probably the most general solution is to make up names for the unnamed
        !           180: functions (for example, an unnamed function statically enclosed in function
        !           181: \fIf\fP might be called \fIf.anon\fP).
        !           182: If anonymous functions are given names they can be treated
        !           183: just as any other function; call counts and execution time will be
        !           184: reported.  Of course, the user will need to associate the new names
        !           185: with the correct function, but in practice this is rarely a problem.
        !           186: .NH
        !           187: An example
        !           188: .LP
        !           189: To illustrate the technique, we present a simple example (figure 1).  The ML function
        !           190: \f(CWsubset\fP takes a predicate function as an argument, and returns
        !           191: a function that maps lists to lists; the output list will be that sublist
        !           192: of the input list containing just those elements that satisfy the predicate.  
        !           193: The user's program is displayed in typewriter font; the compiler puts some
        !           194: scaffolding around it (indicated in italics) to make a record
        !           195: containing all the functions declared by the user.
        !           196: .KF
        !           197: .DS
        !           198: .ft CW
        !           199: \fIlet\fP fun subset pred =
        !           200:        let fun f nil = nil
        !           201:              | f (a::r) = if pred a then a::f(r) else f(r)
        !           202:         in f
        !           203:        end
        !           204: 
        !           205:    fun isPrime x = 
        !           206:        let fun test i = i>=x orelse (x mod i <> 0 andalso test(i+1))
        !           207:         in test 2
        !           208:        end
        !           209: 
        !           210:    val primes = subset isPrime
        !           211: \fI in (subset, isPrime, primes)
        !           212: end\fP
        !           213: .ft R
        !           214: .DE
        !           215: .DS C
        !           216: Figure 1.
        !           217: .DE
        !           218: .KE
        !           219: If this code is compiled with profiling enabled, the compiler inserts
        !           220: the call-counting and current-function instructions into the intermediate
        !           221: representation.  Here, we display the effects as if written in the source language
        !           222: (figure 2).
        !           223: .KF
        !           224: .DS
        !           225: .ft CW
        !           226: \fIlet val subset.CC = ref 0 and subset.IC = ref 0
        !           227:     and subset.f.CC = ref 0 and subset.f.IC = ref 0
        !           228:     and isPrime.CC = ref 0 and isPrime.IC = ref 0
        !           229:     and isPrime.test.CC = ref 0 and isPrime.test.IC = ref 0
        !           230: \fP
        !           231:    fun subset pred =
        !           232:        \fI(subset.CC := !subset.CC + 1;\fP
        !           233:        \fIcurrent := subset.IC;\fP
        !           234:        let fun f x =
        !           235:            \fI(subset.f.CC := !subset.f.CC + 1;\fP
        !           236:            \fIcurrent := subset.f.IC;\fP
        !           237:            case x of
        !           238:              nil => nil
        !           239:            | a::r => let val pa = pred a
        !           240:                       in \fIcurrent := subset.f.IC;\fP
        !           241:                          if pa then a :: f(r) else f(r)
        !           242:                      end
        !           243:         in f
        !           244:        end
        !           245: 
        !           246:    fun isPrime x = \fI(isPrime.CC := !isPrime.CC + 1;\fP
        !           247:                    \fIcurrent := isPrime.IC;\fP
        !           248:                    . . . )
        !           249: 
        !           250:    val primes = subset isPrime
        !           251: \fI
        !           252:  in ((subset, isPrime, primes),
        !           253:      ((subset.CC, subset.IC, "subset"),
        !           254:       (subset.f.CC, subset.f.IC, "subset.f"),
        !           255:       (isPrime.test.CC, isPrime.test.IC, "isPrime.test"),
        !           256:       (isPrime.CC, isPrime.IC, "isPrime")))
        !           257: end\fP
        !           258: .ft R
        !           259: .DE
        !           260: .DS C
        !           261: Figure 2.
        !           262: .DE
        !           263: .KE
        !           264: 
        !           265: For each function, two variables are introduced:  a call-count and an interrupt-count.
        !           266: On entry to a function, the call-count is incremented, and the global variable
        !           267: \fBcurrent\fP is set to point to the interrupt-count.  On re-entry to a function
        !           268: after a subroutine call, \fBcurrent\fP is reset to the function's
        !           269: interrupt-count variable.  However, this is not necessary after recursive calls
        !           270: and tail calls, e.g. the calls to \f(CWf\fP.
        !           271: 
        !           272: The initial \fIlet\fP-bindings create all the count variables, and the
        !           273: last four lines produce, instead of just a record containing the user's
        !           274: declared objects, a pair of records:  the user's declared objects, and a
        !           275: list of records containing profiling variables, each with an identifying
        !           276: string constant.  These string constants will be embedded in the executable
        !           277: code for this module, and will enable the call-count variables to be self-identifying.
        !           278: Our runtime system maintains a global list of these
        !           279: 3-element records; when it is time to print an execution profile, they are
        !           280: sorted in decreasing order of interrupt-count.
        !           281: 
        !           282: .KS
        !           283: Our output looks like the output of \fBprof\fP:
        !           284: .DS
        !           285: .ft CW
        !           286: %time  cumsecs   #call ms/call  name
        !           287:  90.4     3.52   78189    .045  isPrime.test
        !           288:   8.4     3.85    1000    .330  isPrime
        !           289:    .7     3.88       0          (unprofiled)
        !           290:    .2     3.89    1001    .009  subset.f
        !           291:    .0     3.89    1001    .000  natlist
        !           292:    .0     3.89       1    .000  subset
        !           293: .ft R
        !           294: .DE
        !           295: .KE
        !           296: 
        !           297: Now, armed with this information, a programmer
        !           298: might decide that it is worthwhile
        !           299: re-writing the \f(CWisPrime\fP function
        !           300: to make it as efficient as possible.  But at a certain point
        !           301: the programmer will want to know what functions are calling \f(CWisPrime\fP
        !           302: so he can make them call it less often.  By re-compiling with
        !           303: \f(CWisPrime\fP unprofiled, any time spent in \f(CWisPrime\fP will
        !           304: now be charged to the function that called it.  This is because
        !           305: \f(CWisPrime\fP will not change the \fBcurrent\fP variable, so that the
        !           306: timer-interrupt will increment the  count for the function that last
        !           307: set \fBcurrent\fP \(em and this will be the one that called \fPisPrime\fP.
        !           308: The profiling system won't do this automatically, but by comparing two
        !           309: different execution profiles, one with \f(CWisPrime\fP compiled with
        !           310: profile instructions and one with \f(CWisPrime\fP unprofiled, an accurate estimate
        !           311: can be made of who is calling it.
        !           312: .NH 
        !           313: Advantages of our current-function method
        !           314: .LP
        !           315: Since we use ordinary intermediate-representation operators
        !           316: for profiling,
        !           317: the optimizer and code-generator ``believe'' that 
        !           318: profiling operations are part of the program.
        !           319: Since an optimizer must not modify the semantics of the program,
        !           320: the semantics of profiling will not be modified either.
        !           321: Therefore, if one function is copied and inserted in-line into another,
        !           322: the call-count and current-function instructions will be copied and
        !           323: inserted at the right place.  Other optimizations that break functions
        !           324: into several disjoint pieces of code will leave the profiling
        !           325: instructions in the appropriate places.
        !           326: 
        !           327: Furthermore, the result is that the implementation of the profiler
        !           328: is completely independent of the code generator.  We have four different
        !           329: code generators for our compiler (two different algorithms each for the
        !           330: Vax and the Motorola 68020), and not a line of any of them was modified
        !           331: for the installation of the profiler.
        !           332: 
        !           333: By compiling some functions unprofiled, as described in the previous section,
        !           334: we can find out what callers are responsible for most of their execution time.
        !           335: This kind of trick serves much the same purpose that the more elaborate
        !           336: program \fBgprof\fP does; and it's a trick that wouldn't work with
        !           337: a program-counter histogram.
        !           338: Furthermore, our method is more accurate than \fBgprof\fP.  Suppose
        !           339: functions \fIf\fP and \fIg\fP both call a function \fIisPrime\fP, but
        !           340: \fIf\fP consistently makes expensive calls (that take a long time) while
        !           341: \fIg\fP makes cheap ones.  \fBGprof\fP allocates the total time spent in
        !           342: \fIisPrime\fP on the basis of call counts from \fIf\fP and \fIg\fP;
        !           343: this will miss the fact that \fIf\fP is responsible for most of the cost.
        !           344: In this example, when profiling for \fIisPrime\fP is turned off, \fIf\fP
        !           345: and \fIg\fP will be charged for the actual time spent in \fIisPrime\fP
        !           346: on their behalf.  (On the other hand, \fBGprof\fP will give an accurate
        !           347: breakdown of call-site counts that our method does not provide.)
        !           348: 
        !           349: If a profiled function calls an unprofiled function, then during the
        !           350: execution of the called function, all timer interrupts will be charged
        !           351: to the caller (since \fBcurrent\fP still points to the caller's
        !           352: variable).  This is often desirable, as described above.  But if an
        !           353: unprofiled function calls a profiled function, then upon return to the
        !           354: unprofiled function the \fBcurrent\fP pointer won't be reset, and
        !           355: interrupts will continue to be charged to the called function after it
        !           356: has returned.  This is undesirable, and should be prevented by the
        !           357: compiler.  In a language with first-class functions, it is difficult
        !           358: to prevent profiled functions from being passed as arguments to
        !           359: unprofiled functions that might then call them.  In practice, this has
        !           360: not proved to be a problem, probably because unprofiled functionals
        !           361: are typically simple primitives like \fIapp\fP and \fImap\fP, which do
        !           362: little intrinsic computation.
        !           363: .NH
        !           364: Overhead measurements
        !           365: .LP
        !           366: We ran the same program several times with various of our profiling
        !           367: features enabled; this gives a reasonably accurate measurement of profiling
        !           368: overhead:
        !           369: .KF
        !           370: .TS
        !           371: tab(|) box center;
        !           372: l c c
        !           373: l n n.
        !           374: |Time|%Overhead|LenL|LenR|Comp
        !           375: _
        !           376: User code|2801 sec|
        !           377: _
        !           378: Call counts|568|20.3%
        !           379: Setting current function|286|10.2
        !           380: Interrupts|47|1.7
        !           381: _
        !           382: Total Overhead|901|32.2%
        !           383: .TE
        !           384: .KE
        !           385: The total overhead of 32% is not prohibitively expensive.  Our code generator
        !           386: takes three instructions to increment a call-count (fetch, add, store);
        !           387: a better instruction-selector could probably reduce this overhead to 8%, and the
        !           388: total overhead to 20%.
        !           389: 
        !           390: .KF
        !           391: There is also an implementation overhead; it turned out to be fairly simple
        !           392: to get this profiler running.
        !           393: .TS
        !           394: tab(|) center;
        !           395: l n.
        !           396: Insertion of profiling instructions|49 lines
        !           397: Interrupt handling|32
        !           398: Global database|16
        !           399: Report generation|72
        !           400: _
        !           401: Total|169 lines
        !           402: .TE
        !           403: In contrast, this paper is about 500 lines long.
        !           404: .KE
        !           405: .NH
        !           406: Conclusion
        !           407: .LP
        !           408: Traditional approaches to profiling run into problems when we attempt to apply
        !           409: them to functional languages where code may be moved around by garbage collection,
        !           410: and the task is further complicated when an optimizing compiler freely rearranges
        !           411: the code.  The basic difficulty is that the mapping between the current pc and the
        !           412: currently executing function is difficult to maintain.  
        !           413: 
        !           414: We have found a simple way around this difficulty, which consists of
        !           415: maintaining a global variable that always points to the interrupt
        !           416: count for the current function, and which is to be charged whenever
        !           417: there is a timer interrupt.  Because we manipulate this variable in the
        !           418: intermediate representation of the compiler, our method is very easy
        !           419: to implement and has no nasty interactions with code generation or
        !           420: garbage collection algorithms (which already preserve semantics of
        !           421: intermediate-representation operations).
        !           422: 
        !           423: This method has acceptable overhead and accuracy.  Furthermore, by
        !           424: judiciously mixing profiled and unprofiled functions, one can extract
        !           425: information on inherited costs as well as the direct costs of calling
        !           426: particular functions.  This information is similar to that provided by
        !           427: sophisticated profilers like gprof, but is more accurate.
        !           428: .SH
        !           429: References
        !           430: .LP
        !           431: .IP [Appel87] 1i
        !           432: Appel, Andrew W. and MacQueen, David B.  ``A Standard ML compiler,''
        !           433: in \fIFunctional Programming Languages and Computer Architecture\fP,
        !           434: LNCS 274, G. Kahn, ed., pp 301-324, 1987
        !           435: .IP [Graham82]
        !           436: Graham, Susan L. Graham, Peter B. Kessler, and Marshall K. McKusick.
        !           437: ``gprof: a call graph execution profiler''" in 
        !           438: \fIProc. SIGPLAN '82 Symp. on Compiler Construction, SIGPLAN Notices\fP
        !           439: 17(4), pp. 120-126, 1982.
        !           440: .IP [Johnston70]
        !           441: Johnston, T. Y., and Johnson, R. H., \fIProgram Performance
        !           442: Measurement\fP, SLAC User Note 33, Rev. 1, Stanford University, California, 1970.
        !           443: .IP [Unix]
        !           444: Unix Programmer's Manual, ``prof command,'' section 1, Bell Laboratories, Murray Hill,
        !           445: NJ, 1979.
unix.superglobalmegacorp.com
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.