researchv10no/cmd/sml/doc/papers/runtime/run.tex - annotate

Return to run.tex CVS log
Up to [Research Unix] / researchv10no / cmd / sml / doc / papers / runtime
Annotation of researchv10no/cmd/sml/doc/papers/runtime/run.tex, revision 1.1.1.1

1.1       root        1: \documentstyle[11pt]{article}
                      2: \bibliographystyle{plain}  %changed from alpha
                      3: \documentstyle{openbib}
                      4: \makeindex
                      5:     \oddsidemargin   .5in     %   Left margin on odd-numbered pages.
                      6:     \evensidemargin  .5in     %   Left margin on even-numbered pages.
                      7:    \topmargin       .5in    %    Nominal distance from top of page to top of
                      8:    \headheight        0pt    %    Height of box containing running head.
                      9:    \headsep           0pt    %    Space between running head and text.
                     10: % Bottom of page:
                     11:    \footheight       12pt    %    Height of box containing running foot.
                     12:    \footskip         30pt    %    Distance from baseline of box containing 
                     13:                              %    foot to baseline of last line of text.
                     14:    \textwidth         5.5in  %    width of text on page
                     15: 
                     16: %\newcommand{\mysection}[1]{\section{\protect\large\bf #1}}
                     17: %\setcounter{secnumdepth}{0}
                     18: 
                     19: \sloppy
                     20: 
                     21: \begin{document}
                     22: \title{A Runtime System \\ (DRAFT)}
                     23: \author{Andrew W. Appel\thanks{
                     24: Supported in part by NSF Grants
                     25: DCR-8603453 and CCR-880-6121}
                     26:         \\ Princeton University}
                     27: \date{February 1989}
                     28: \maketitle
                     29: \begin{abstract} 
                     30: The runtime data structures of the Standard ML of New Jersey compiler
                     31: are simple yet general.  As a result, code generators are easy to implement,
                     32: programs execute quickly, garbage collectors are easy to implement and
                     33: work efficiently, and a variety of runtime facilities can be provided with
                     34: ease.
                     35: \end{abstract}
                     36: 
                     37: \section{Introduction}
                     38: \label{introduction}
                     39: 
                     40: Some languages, like Lisp, Smalltalk, ML, Prolog, etc. rely heavily on a
                     41: {\em runtime system} to provide essential services.  The most important
                     42: such service is the management of dynamically-allocated storage
                     43: (e.g. {\bf garbage collection}), but the runtime system may also provide
                     44: facilities like
                     45: 
                     46: \begin{itemize}
                     47: \item {\bf Stream input/output:} on operating systems (like Unix) that do not
                     48: have a buffered input/output facility, the process must provide its own;
                     49: this might be handled in the runtime system.
                     50: \item {\bf Structured input/output:} the ability to automatically
                     51: write a large linked data structure to a file, and read it back in
                     52: with all links adjusted, is a great convenience that can be implemented
                     53: efficiently in the runtime system.
                     54: \item {\bf Process suspension:} a snapshot of an
                     55: executing process may be ``preserved'' in a file, so that
                     56: the execution of that file causes a new process to start exactly
                     57: where the saved one left off.
                     58: \item {\bf Operating system calls:} operating system services needed by
                     59: a program may be conveniently packaged by the runtime system.
                     60: \item {\bf Handling of interrupts and asynchronous events:} if the programming
                     61: language has a mechanism to handle asynchronous events, it relies on the
                     62: runtime system for its implementation.
                     63: \item {\bf Handling of arithmetic exceptions:} the programming language's
                     64: exception-handling mechanism must be implemented in cooperation with
                     65: the runtime system.
                     66: \item {\bf Assembly-language implementation of language primitives:}
                     67: it may be inconvenient for the compiler to generate code for some features
                     68: of the programming language; these functions can be implemented as calls
                     69: to runtime system routines.
                     70: \item {\bf Foreign-language procedure calls:} calls to subroutines written
                     71: in other languages may be mediated by the runtime system.
                     72: \item {\bf Fun with continuations:} Languages with features like {\em call with
                     73: current continuation}, which allows the explicit manipulation of threads
                     74: of control, require runtime-system cooperation.
                     75: \item {\bf Execution profiling:} Automatic measurement of the time spent in
                     76: different parts of the user program can be accomplished with the help of
                     77: the runtime system.
                     78: \item {\bf Debugging:} starting, stopping, and displaying the execution state
                     79: of user programs can be accomplished only by low-level routines.
                     80: \end{itemize}
                     81: 
                     82: Since this list is rather long, and several of these features may interact,
                     83: it is evident that runtime systems can become nasty and complicated.
                     84: The proliferation of data types may make the implementation of 
                     85: the garbage collector (and other programs that must traverse the data)
                     86: inefficient.
                     87: 
                     88: We particularly wanted a runtime data layout that provided fast
                     89: allocation of records and fast garbage collection, since ML makes very
                     90: heavy use of dynamically-allocated storage.  To this end, we eliminated
                     91: the runtime stack and use a very simple data format.
                     92: The only runtime data types are {\em integers}, {\em pointers},
                     93: {\em records}, and {\em strings}.
                     94: Then we found that this very simple structure allowed the easy implementation
                     95: of many of the services described above.
                     96: 
                     97: Our runtime system has relatively few ML-specific
                     98: features; it could be used equally well for other languages.  This paper
                     99: describes the design and implementation of the SML-NJ runtime system.
                    100: It should be read in conjunction with an earlier paper on the
                    101: SML-NJ generational garbage collector\cite{appel89:sggc}.
                    102: 
                    103: \begin{quotation}{
                    104: \small Text in smaller type is of interest only to those who might
                    105: have to read, modify, or maintain the source code to the runtime system.
                    106: }\end{quotation}
                    107: 
                    108: \section{Standard ML of New Jersey}
                    109: \label{standard}
                    110: The ML language originated as part of the Edinburgh LCF proof system
                    111: \cite{gordon78}, and was soon implemented as a stand-alone compiler
                    112: \cite{cardelli83:ml}\cite{cardelli84}.  The language was
                    113: ``standardized'' \cite{milner84}\cite{macqueen84}
                    114: \cite{milner85}\cite{harper88}, and
                    115: Standard ML has been implemented at Edinburgh, Cambridge, and 
                    116: New Jersey\cite{appel87:sml}; the New Jersey implementation is
                    117: a joint effort between researchers at AT\&T Bell Laboratories and
                    118: Princeton University.
                    119: 
                    120: Though ML was first implemented as the meta-language of a theorem-proving
                    121: system, Standard ML is a general-purpose 
                    122: programming language with several advantages over more conventional
                    123: languages.  Its important characteristics are:
                    124: 
                    125: \begin{itemize}
                    126: \item {\bf Automatic garbage collection:} this is a great convenience in
                    127: writing correct and readable programs.
                    128: \item {\bf Static, polymorphic types:}  like Pascal, types are checked
                    129: at compile-time and not at runtime; but like Lisp, there is great flexibility
                    130: and re-usability of code.
                    131: \item {\bf Safety:} there are no runtime insecurities (i.e. ``core is never
                    132: dumped''); this is unlike the C language, where unsafe pointers run
                    133: rampant, and like Lisp (except when Lisp programmers turn off the runtime
                    134: type checking for efficiency).  In ML, there is no run-time type checking,
                    135: but safety is guarateed by compile-time type checking.
                    136: \item {\bf Higher-order functions:} like Scheme (and lambda-calculus); this
                    137: can lead to a desirable conciseness of expression.
                    138: \item {\bf Typechecked Modules:} like Ada and Modula.
                    139: \item {\bf Exception handling:} a dynamically-scoped exception
                    140: mechanism allows both hardware- and software-generated exceptions
                    141: to be caught and handled by user programs.
                    142: \end{itemize}
                    143: 
                    144: One important feature of a systems programming language
                    145: is not prescribed by the list above, and in fact cannot be specified
                    146: by a language definition: an efficient and robust implementation.
                    147: Standard ML of New Jersey\cite{appel87:sml} is meant to be a complete,
                    148: efficient, robust, and cleanly written compiler for the language.
                    149: It has an optimizing code generator based on continuation-passing style
                    150: \cite{appel89:cps}.
                    151: 
                    152: A knowledge of ML is not necessary to understand its runtime system.
                    153: 
                    154: \section{Tagging schemes}
                    155: \label{tagging}
                    156: 
                    157: Almost all the pieces of the runtime system must deal with the data structures
                    158: of the executing program.  Therefore, it is helpful to keep the format of
                    159: this data as simple and straightforward as possible.
                    160: 
                    161: ML, like Lisp, allows polymorphic functions: a function that reverses a list
                    162: of objects (for example) need not know the type of the objects.  In order that
                    163: the same piece of executable code can operate on objects of arbitrary type,
                    164: it is necessary that every object be represented in the same amount of space.
                    165: As in Lisp, this is achieved by making everything be the size of a pointer;
                    166: if an object's natural representation is larger (as for a record of $n$
                    167: objects), then it is represented by a pointer to storage on the heap.
                    168: 
                    169: The garbage collector must be able to determine the size and layout of each
                    170: object it traverses.  This can be accomplished in several ways:
                    171: \begin{itemize}
                    172: \item {\bf  By encoding a type-tag\index{tag}
                    173: inside each pointer;} we chose not to do this
                    174: because it reduces the number of bits left for actual addressing.
                    175: In these days of 32-bit pointers and 100-megabyte memories, it is easy
                    176: to imagine the need for every bit of addressability we can muster.
                    177: \item {\bf By reserving different areas of memory for objects of different
                    178: types.}
                    179: BIBOP\index{BIBOP} (Big Bag Of Pages) schemes require that each page hold objects of a single
                    180: type, and a global table maps pages to types.  This is relatively efficient,
                    181: and doesn't reduce the number of addressable words.  But it complicates the
                    182: process of allocating and copying objects, since several free regions
                    183: are simultaneously required.  
                    184: \item {\bf Statically-typed languages don't require any runtime-tagging} at all;
                    185: instead, the compiler can tell the garbage collectector about the type
                    186: system of the program.  This has worked well in Pascal\cite{britton75}.
                    187: Even though it is also theoretically possible in ML, in practice the
                    188: polymorphic type system introduces overhead and complexity that make
                    189: this method unattractive\cite{appel89:tag}.
                    190: \item {\bf Each record can have a descriptor at the beginning} that explains
                    191: which fields are integers and which are pointers.  This method works
                    192: well in non-polymorphic languages like Modula-2+ and Mesa, where
                    193: descriptors can be computed at compile-time and just inserted
                    194: in records as they are created.  However,
                    195: in polymorphic languages like Lisp and ML
                    196: this descriptor would have to be laboriously
                    197: constructed each time a record is created, introducing unacceptable overhead.
                    198: \item {\bf Each record can have a descriptor\index{descriptor} at the
                    199: beginning} that tells
                    200: the length of the record, and {\bf each field can have a tag bit} that tells
                    201: whether it is a pointer or a non-pointer.
                    202: This is what we have done.
                    203: \end{itemize}
                    204: 
                    205: To have a tag word on each record, and a tag bit on each
                    206: field, is not clever at all; but we are not always
                    207: striving for cleverness, we 
                    208: want simplicity and efficiency.  If all our records were {\em cons} cells,
                    209: then one-third of memory would be devoted to tag words; but memories
                    210: are large and cheap.  And in Standard ML, records are of many different
                    211: sizes, and two-word records are not particularly predominant.
                    212: 
                    213: By having only one free region, we are
                    214: able to do fast allocation\index{allocation}
                    215: and also manage the total amount of virtual
                    216: memory used\cite{appel89:sggc}; this would be much more difficult
                    217: with a BIBOP scheme.
                    218: 
                    219: \section{Data formats in our runtime system}
                    220: \label{data}
                    221: Standard ML of New Jersey has
                    222: records, strings, procedures (machine code),
                    223: closures, constructors, arrays, byte-arrays, floating-point
                    224: numbers, references (modifiable one-word records),
                    225: modules, and integers.  This large set of language primitives and
                    226: user-defined datatypes are represented
                    227: by just two runtime data formats: one for objects that contain pointers
                    228: (records, closures, constructors, arrays, references, modules), and
                    229: the other for objects containing no pointers (strings, procedures,
                    230: byte-arrays, floating point).  The garbage collector (and other
                    231: parts of the runtime system) need to understand only these two formats,
                    232: not the many kinds of ML objects.
                    233: 
                    234: A {\em field\index{field}} is either a 
                    235: {\em pointer\index{pointer}} or an {\em integer\index{integer}}.  Pointers
                    236: have a low-order bit of 0; integers have a low-order bit of 1.
                    237: It is necessary to distinguish pointers from non-pointers in order that
                    238: the garbage collector\index{garbage collector}
                    239: will know what structures to traverse.
                    240: On a byte-addressable machine, all pointers to aligned (4-byte) words
                    241: are multiples of 4 anyway, so pointers have a low-order bit of 0 in
                    242: their natural representation.  The high-order 31 bits of an integer
                    243: field can represent an integer in the range $[-2^{30}, +2^{30}-1]$.
                    244: 
                    245: Some Lisp implementations use a similar representation except that
                    246: pointers have a low-order 1 and integers have a low-order 0. This makes
                    247: arithmetic on tagged integers somewhat easier, and takes advantage
                    248: of the fact that pointers are usually used with a displacement addressing mode;
                    249: the displacement can be adjusted by 1 with no penalty in efficiency.
                    250: This is probably a better arrangement, but 
                    251: either version of this scheme will work; and a simple analysis\cite{appel89:tag}
                    252: shows that the efficiency trade-offs are negligible, and that 
                    253: doing arithmetic around the low-order integer tag is not very costly.
                    254: 
                    255: An {\em object on the heap} may be either a {\em record\index{record}}
                    256: (containing fields,
                    257: i.e. pointers and tagged integers) or
                    258: a {\em string\index{string}}
                    259: (containing bytes of an arbitrary bit-pattern, but no pointers).
                    260: 
                    261: A {\em record} is a sequence of $n>0$ fields numbered $0,1,...,n-1$.
                    262: Each record has a {\em descriptor} at position $-1$.  The low-order bit
                    263: of a descriptor\index{descriptor} is 1
                    264: (making it look like an integer), the next three
                    265: bits identify the object as a record, and the high-order 28 bits give the
                    266: number of fields.  Thus, each record is limited to one gigabyte, which
                    267: should not be a significant limitation (seriously, though, it is important
                    268: to avoid miserly restrictions on the sizes of objects).
                    269: 
                    270: In the implementation of closures, it is useful to be able to point
                    271: at the interior of records.  However, it is always necessary for the
                    272: garbage collector, given a pointer, to find the descriptor of an object.
                    273: If the
                    274: fields $0$ through $k-1$ of a record are all pointers, then a pointer
                    275: can point at the $k^{\rm th}$ field of the record; since the
                    276: descriptor is unboxed (has a low-order bit of 1) and the fields $0$ through
                    277: $k-1$ will all be boxed, the garbage collector can easily search backward
                    278: for the descriptor of the record.  So, we allow pointers to the interior
                    279: of a record, providing all the previous fields are boxed; this is sufficiently
                    280: flexible for our needs in implementing closures, as will be described.
                    281: Otherwise,
                    282: no pointer may address the interior of a record.
                    283: 
                    284: A {\em string\index{string}}
                    285: contains $n>0$ bytes (numbered from $0$ to $n-1$),
                    286: padded with trailing zero bytes
                    287: to a multiple of four.  
                    288: Immediately preceding the $0^{\rm th}$ byte is a one-word
                    289: descriptor whose low-order bit is 1, whose next three bits describe
                    290: the type of string (and distinguish strings from records),
                    291: and whose upper 28 bits give the length in bytes.
                    292: 
                    293: Strings are used for a variety of purposes;
                    294: they can hold printable characters, real numbers, machine code,
                    295: or any other kind of data that doesn't contain pointers.  In fact,
                    296: a string object can hold several embedded string objects.
                    297: A pointer may point at the $0^{\rm th}$ byte of a string, and may point to 
                    298: an embedded string at the
                    299: $k^{\rm th}$ byte provided that $k>0$ is a multiple of 4 and that the bytes
                    300: $k-4, k-3, k-2, k-1$ contain a {\em back-pointer\index{back-pointer}}
                    301: descriptor that
                    302: contains the number $k$ (the offset to the beginning of the record).
                    303: This allows the beginning of a string object
                    304: to be found quickly, given a pointer into it.
                    305: Note that a back-pointer couldn't be distinguished from data, except
                    306: for the fact that it is found just before the pointed-to byte of the
                    307: string.
                    308: 
                    309: The three tag\index{tag} bits of a descriptor\index{descriptor}
                    310: can denote any of these kinds of records,
                    311: strings, etc:
                    312: 
                    313: \begin{itemize}
                    314: \item[0] record\index{record}
                    315: \item[1] forwarding-pointer\index{forwarding pointer}
                    316: (for the garbage collector\index{garbage collector})
                    317: \item[2] back-pointer (that precedes an referenceable location in a string)
                    318: \item[3] embedded string\index{embedded string} length (described below)
                    319: \item[4] array\index{array} (just like a record).
                    320: \item[5] byte-array\index{byte-array}
                    321: (just like a string).
                    322: \item[6] this tag value is unused.
                    323: \item[7] string\index{string}
                    324: \end{itemize}
                    325: 
                    326: Because of ML's static type system, it is not necessary to put type information
                    327: in the descriptors of objects.  Therefore, tags are necessary only
                    328: for the garbage collector's benefit, since it must distinguish objects of
                    329: different formats.
                    330: 
                    331: The only exception to this rule is that mutable objects (arrays, byte-arrays)
                    332: must be distinguishable from immutable objects (records, strings)
                    333: in order to implement the polymorphic
                    334: equality\index{equality}
                    335: feature.  For all purposes of the runtime system, arrays
                    336: are identical to records, and byte-arrays are identical to strings.
                    337: 
                    338: A dynamically-typed language like Lisp would require more than
                    339: three bits to distinguish types of objects; this would not pose a 
                    340: serious problem.  It would still be possible in Lisp to have just
                    341: two formats (pointer-containing and pointer-free), as in our runtime system.
                    342: 
                    343: \section{Forwarding pointers}
                    344: \label{forwarding}
                    345: \index{forwarding}
                    346: The Standard ML of New Jersey runtime system has a generational garbage
                    347: collector\index{garbage collector}
                    348: that takes advantage of object lifetime and referencing
                    349: patterns\cite{appel89:sggc}.
                    350: But at the heart of any generational garbage collector is a simple
                    351: copying garbage collector as originally described by Cheney\cite{cheney70}.
                    352: Objects are copied from {\em fromspace} to {\em tospace} in a breadth-first
                    353: order, with the {\em tospace} itself serving as the queue for the breadth-first
                    354: traversal.  The {\em fromspace} versions of objects are overwritten
                    355: with {\em forwarding pointers}, so that when other references to them are
                    356: found, it is easy to find the {\em tospace} copies of them.
                    357: 
                    358: The fundamental operation in Cheney's algorithm is to {\em forward} a pointer.
                    359: This means taking a pointer into {\em fromspace} and making it point to {\em tospace}.
                    360: If the {\em fromspace} object it points to has already been copied, then its
                    361: forwarding pointer is taken as the new value; otherwise, the object must
                    362: be copied to {\em tospace} and a forwarding pointer installed.
                    363: 
                    364: Forwarding is relatively easy using our runtime data format.  We have a special
                    365: kind of descriptor {\bf forwarding-pointer}, that indicates that a {\em fromspace}
                    366: object has already been copied.  If an object has this descriptor, then
                    367: the first word (after the descriptor) is to be interpreted as the address
                    368: of the copy.  The only complications in forwarding are that pointers
                    369: may point into the middle of records and strings.
                    370: 
                    371: Our runtime system (and garbage collector) is implemented in C.
                    372: In figures~\ref{forward}, \ref{trappc}, \ref{gc} we show the entire
                    373: core of our copying garbage collector.  We could use pseudo-code
                    374: and describe the algorithm abstractly, but we want to emphasize that
                    375: our simple runtime data format does permit an efficient and easy-to-implement
                    376: garbage collector.
                    377: 
                    378: We pretend that all ML values are integers, and make liberal use of
                    379: casts to maintain this pretense.  Our \verb"forward" function is
                    380: shown in figure~\ref{forward}; it
                    381: takes an ML value (by reference) and modifies it (if a pointer into
                    382: {\em fromspace}) to point into {\em tospace}.
                    383: \begin{figure}[htbp]
                    384: \label{forward}
                    385: \begin{verbatim}
                    386:  1    forward(int *refloc)
                    387:  2    {register int *m = *((int**)(refloc)), len;
                    388:  3     if( (m&1)==0 && (m >= (int*)lowest && m < (int*)highest))
                    389:  4     { m--;  /* make m point at the descriptor */
                    390:  5       while(1)
                    391:  6         {len = (*m)>>4;
                    392:  7          switch(m&15)
                    393:  8           {case tag_backptr:
                    394:  9                    m -= len;
                    395: 10                    continue;
                    396: 11            case tag_string:
                    397: 12            case tag_bytearray:
                    398: 13                len = (len+3)/4;
                    399: 
                    400: 19                /* fall through */
                    401: 20            case tag_record:
                    402: 21            case tag_array:
                    403: 22                {int **i=(int**)m, **j=to_ptr;
                    404: 23                     while (j+len >= to_lim)
                    405: 24                       do to_lim=gmore();
                    406: 25                     while (len-- >= 0)          
                    407: 26                       do {*j++ = *i++;}
                    408: 27                     ((int**)m)[1]= 1+(int*)to_ptr;
                    409: 28                     to_ptr = j;
                    410: 29                }
                    411: 30                (*m) = tag_forwarded;
                    412: 31                /* fall through */
                    413: 32            case tag_forwarded:
                    414: 33                *(int*)(refloc) += ((int*)m)[1] - ((int)(m+1));
                    415: 34                return;
                    416: 35            case tag_embedded:
                    417: 36            default:
                    418: 37                    m--; continue;
                    419: 38   }}   }  }
                    420: \end{verbatim}
                    421: \caption{The {\tt forward} function}
                    422: \end{figure}
                    423: Line 2 establishes $m$ as a copy of the original value to be forwarded.
                    424: Line 3 considers only the case that $m$ is a pointer, and points into
                    425: {\em fromspace}.  (Any pointers not into {\em fromspace} are treated as constants.)
                    426: Line 4 adjusts m to point at the descriptor of the {\em fromspace} object.
                    427: 
                    428: Lines 5--38 loop until the beginning of the object is found.
                    429: If [line 8] $m$ points to a back-pointer\index{back-pointer},
                    430: the appropriate offset is subtracted
                    431: and we start again; similarly, if [lines 35,36] $m$ points to a pointer
                    432: (which can happen if we
                    433: point into the middle of a closure\index{closure} record) or an 
                    434: embedded\index{embedded string} descriptor,
                    435: then $m$ is decremented and we try again.
                    436: 
                    437: If $m$ is a string\index{string length} or byte array [line 13],
                    438: then we adjust its $len$ to count in words rather
                    439: than bytes (i.e. we divide by 4, rounding up).
                    440: (Lines 14--18 are explained below.)
                    441: 
                    442: Now at line 22 we have either a string or record in {\em fromspace}
                    443: that needs to be copied.  We verify that there's enough space remaining
                    444: in {\em tospace}; if not, we call the function \verb"gmore" that will either
                    445: \index{gmore}
                    446: get more space or die trying.  Then we copy all the words of the
                    447: object.  Finally [lines 27 and 30] we install a forwarding pointer
                    448: into the {\em fromspace} object and mark its descriptor as forwarded.
                    449: 
                    450: For an already forwarded object (line 33), whether it was forwarded
                    451: in a previous call or just now, adjust the reference \verb|*refloc|
                    452: to point at the new object.  Since the reference might have pointed
                    453: at the middle of the old object, we must take care to make it point
                    454: to the corresponding location in the new object; this is accomplished
                    455: by the computation:
                    456: \[
                    457: {\rm old\;pointer} + {\rm beginning\;of\;new\;object} - 
                    458: {\rm beginning\;of\;old\;object.}
                    459: \]
                    460: 
                    461: Then the \verb"forward" function returns.
                    462: 
                    463: Almost without exception, all pointers to objects point to places where
                    464: an immediately preceding descriptor\index{descriptor}
                    465: explains the format of the object.
                    466: The exception is an artifact of the 
                    467: very fast allocation mechanism that can create and initialize
                    468: a $k$-word object in $k+2$ instructions\cite{appel89:sggc}.
                    469: This mechanism relies on a page fault\index{fault}
                    470: trap to invoke the garbage collector
                    471: when the free space is exhausted; when this trap occurs, the program
                    472: \index{pc}
                    473: counter will point into the middle of a string object that might
                    474: be moved by the garbage collector.  This is the only reference that
                    475: points to a place in a string that lacks a backpointer\index{back-pointer}.
                    476: It can be handled
                    477: relatively simply: as each string object is moved, we check to see
                    478: if the saved program counter points into the middle of it; if so,
                    479: we adjust the saved program counter.  Since the number of string objects
                    480: is typically less than one percent of the number of record objects,
                    481: very costly overall.  The test occurs at line 14, as shown in
                    482: figure~\ref{trappc}.
                    483: \index{trap_pc}
                    484: \begin{figure}[htbp]
                    485: \label{trappc}
                    486: \begin{verbatim}
                    487: 
                    488: 14                if (!trap_pc_done 
                    489: 15                     && m < trap_pc && m+len >= trap_pc)
                    490: 16                    {trap_pc_done=1;
                    491: 17                     trap_pc += to_ptr - (int**)m;
                    492: 18                    }
                    493: \end{verbatim}
                    494: \caption{Testing the (saved) program counter}
                    495: \end{figure}
                    496: 
                    497: \section{Garbage collection}
                    498: \label{garbage}
                    499: 
                    500: Once the \verb"forward"ing procedure is written, a simple copying garbage
                    501: collector is trivial (figure~\ref{gc}).
                    502: \index{garbage collector}
                    503: \begin{figure}[htbp]
                    504: \label{gc}
                    505: \begin{verbatim}
                    506:  1    gc(int *from_low, int *from_high,
                    507:  2       int *to_low, int *to_high,
                    508:  3       int **roots,
                    509:  4       int *trap_pcx, int *(*get_more)()
                    510:  5      )
                    511:  6    { gmore=get_more; trap_pc = *trap_pcx; to_ptr = to_low;
                    512:  7      trap_pc_done = !(trap_pc>=(int*)from_low 
                    513:  8                      && trap_pc<(int*)from_high);
                    514:  9      lowest=from_low; highest=from_high;
                    515: 
                    516: 10      while (*roots) forward(*roots++);
                    517: 
                    518: 11      { int *x = to_low;
                    519: 12      while (x<to_ptr)
                    520: 13      { int *p = x+1, descr = (*x), len=descr>>4;
                    521: 14        if (string_or_bytearray(descr)) 
                    522: 15              x += (len+3)/4 + 1;
                    523: 16        else {x += len+1;
                    524: 17              do {forward(p++);} while (p<x);
                    525: 18      }      }
                    526: 
                    527: 19      *trap_pcx = trap_pc;
                    528: 20    }
                    529: \end{verbatim}
                    530: \caption{The {\tt gc} function}
                    531: \end{figure}
                    532: The function \verb|gc| is parametrized by the lower and upper bounds
                    533: of {\em fromspace} and the lower and upper bounds of {\em tospace}.  The \verb"roots"
                    534: parameter is a vector of pointers to the roots of accessible objects;
                    535: in the Standard ML system this is little more than the addresses of
                    536: saved machine registers.  The last two parameters are the address
                    537: of the saved program counter (necessary as explained in 
                    538: section~\ref{forwarding})
                    539: and the address of a function which can be called to expand the {\em tospace}
                    540: if necessary.
                    541: 
                    542: The first step [lines 6--9] is to put various quantities into global
                    543: variables to make them accessible to the \verb"forward" procedure,
                    544: since C does not have nested procedures.
                    545: 
                    546: The next step is to forward all the root variables (line 10).
                    547: 
                    548: Finally [lines 11-18],
                    549: we forward each word of each record in {\em tospace}.  Since the
                    550: \verb|forward| procedure may increment the \verb"to_ptr" variable
                    551: that denotes the end of the filled portion of {\em tospace}, we have to
                    552: keep iterating until \verb|x| catches up with it.  In effect,
                    553: the {\em tospace} between \verb|x| and \verb|to_ptr| is the queue of the
                    554: breadth-first search.  The queue must eventually become empty,
                    555: since there is a finite amount of accessible data to be copied.
                    556: In this phase, integers (and strings) are just skipped, since
                    557: they contain no pointers that need forwarding.
                    558: 
                    559: This garbage collector is relatively simple, and therefore it's not difficult
                    560: to make it fast.  Even the fanciest of generational or concurrent collectors
                    561: relies on a program like this as its inner loop; the very simple layout
                    562: of data in the Standard ML runtime system makes efficiency easy.
                    563: 
                    564: \section{Fast allocation}
                    565: \label{fast}
                    566: 
                    567: Copying garbage collection, because it takes time proportional only to the
                    568: live data and not to the amount of garbage, can be arbitrarily 
                    569: efficient\cite{appel87:gc}.  That is, there is no lower bound on the
                    570: amortized cost of garbage collection for each cell allocated.
                    571: Standard ML's generational copying garbage collector\cite{appel89:sggc}
                    572: expends on the order of one (amortized)
                    573: machine instruction for every cell allocated; the precise amortized
                    574: cost is proportional to the ratio of live data to heap size.
                    575: 
                    576: Since garbage collection is so fast, it makes sense to make
                    577: allocation\index{allocation}
                    578: fast too.  Standard ML of New Jersey allocates a new record every
                    579: 40 to 80 machine instructions, so we want a very low overhead on the
                    580: creation of an object.
                    581: 
                    582: Since copying garbage collectors compact the live objects into consecutive
                    583: memory cells, the free region is all contiguous.  This means that to create
                    584: an $n$-word object, we can just grab the next $n$ words of the free space.
                    585: Since objects are created so often, it makes sense to make allocation
                    586: in-line (with no procedure call) and to reserve a register \verb"fr" to point
                    587: to the beginning of the free region.  Then it becomes very simple
                    588: to allocate a new object:  \verb"A = CONS(B,C)" is implement as
                    589: \begin{enumerate}
                    590: \item mem[fr+2] := C
                    591: \item mem[fr+1] := B
                    592: \item mem[fr]   := descriptor(2,tag\_record)
                    593: \item A := fr+1
                    594: \item fr := fr+3
                    595: \end{enumerate}
                    596: Each of these lines is one machine instruction.  Thus, in five instructions
                    597: (plus one instruction of amortized garbage collection overhead)
                    598: we have made a new {\em cons} cell; it takes only twice as long to create
                    599: a data structure as it does to read it!  
                    600: This fast allocation encourages a simple and clean programming style;
                    601: no longer do programmers have to stand on their heads to avoid {\em cons}ing.
                    602: 
                    603: Of course, this won't work if the free region is exhausted.  We can insert
                    604: an explicit test to make sure that \verb"fr" is not near the end of the
                    605: free region.  But a more clever trick is to make the virtual memory page
                    606: at the end of the free region inaccessible, so that we will get a page
                    607: fault\index{fault}\index{page fault}
                    608: when the free region is exhausted.  That's why we store the
                    609: last field (\verb"mem[fr+2]") first in the example above; it's simpler
                    610: for the garbage collector if the fault occurs at the very beginning
                    611: of creating the new cell.
                    612: (On the MC68020, the state of the machine at a page fault is complicated,
                    613: and it's not easy to restart the faulting instruction; so on that machine
                    614: we use an explicit comparison with a limit register.)
                    615: 
                    616: The page fault will be caught by the hardware and handed to the operating
                    617: system, which can then pass control to the user process.  The user process
                    618: then has to find all the registers of the faulting procedure; these
                    619: registers are the roots of the accessible data.  Appel\cite{appel89:sggc}
                    620: and Cormack\cite{cormack88} both describe schemes for finding these
                    621: registers; Cormack's is simpler and more reliable.
                    622: 
                    623: The ML program and the garbage collector behave like two processes
                    624: (threads\index{thread})
                    625: running in the same address space: while one thread executes, the
                    626: other's registers are saved.  When ML suspends itself (either
                    627: because of a page fault indicating end of free space, or voluntarily),
                    628: it saves its registers into a static area that looks just like a process
                    629: control block, so that the garbage collector (and other parts of the
                    630: runtime system) can access them.
                    631: 
                    632: \begin{quotation}
                    633: {\small Here's how Cormack's scheme works:  A page fault arrives, causing
                    634: the operating system to invoke the C function \verb"ghandle".
                    635: \index{ghandle}
                    636: This function is passed (as an argument) a structure containing
                    637: the address of the faulting instruction.  Other saved registers are
                    638: at undocumented locations on the stack.  \verb"ghandle" saves the
                    639: \index{saved_pc}\index{pc}
                    640: faulting pc in \verb"saved_pc", and modifies its argument structure
                    641: to point to the assembly-language function \verb"saveregs".  Then it
                    642: returns; the operating system restores the pc from the argument
                    643: structure, restores other registers (from those undocumented locations),
                    644: and resumes.  But of course, we have fooled the operating system into
                    645: \index{saveregs}
                    646: resuming in \verb"saveregs", which stores all registers into known
                    647: global variables and returns to the C thread (i.e. in the function
                    648: \verb"runML").
                    649: 
                    650: The C thread (typically) does garbage collection, then calls the
                    651: assembly-language function \verb"restoreregs", which loads registers
                    652: (and program counter) from their global variables and resumes execution
                    653: of the ML thread.
                    654: 
                    655: Sometimes it's useful to invoke the C thread without a page fault,
                    656: e.g. to export the process state into a file or to do a structured
                    657: write.  In this case, \verb"saveregs" can be called directly from assembly
                    658: code in the ML thread.
                    659: }
                    660: \end{quotation}
                    661: \section{Heap allocation of procedure call frames}
                    662: \label{heap}
                    663: 
                    664: Since heap allocation is so cheap, we put procedure call frames on the
                    665: heap instead of on the stack\index{stack}.
                    666: This has many advantages\cite{appel89:cps},
                    667: but the one of interest here is that it greatly simplifies the runtime
                    668: system.  Many of the facilities described in this paper
                    669: would be much more complicated to implement if runtime stacks had to be
                    670: dealt with.
                    671: 
                    672: Any record on the stack must be initialized as it is created;
                    673: otherwise it will contain garbage data that could be interpreted
                    674: by the garbage collector as spurious pointers.  However, typical
                    675: code generators will allocate a call frame on
                    676: entry to a procedure and spill registers into it as needed.
                    677: This must be avoided.  One way to solve
                    678: this problem is to delay allocation of the frame; values can be accumulated
                    679: in registers until spilling is necessary, then all the registers can 
                    680: be spilled at once into a newly-created frame, which is just a record
                    681: object in runtime data format.  Thus, the frame is completely initialized
                    682: as it is created.
                    683: 
                    684: The Standard ML of New Jersey compiler doesn't use ``procedure call frames.''
                    685: Since it uses continuation\index{continuation}-passing 
                    686: style\cite{steele78}\cite{kranz86}\cite{appel89:cps},
                    687: what an ordinary mortal might call a ``frame'' is really just the
                    688: closure of a continuation.  It is easy to create closures as ordinary
                    689: records that are completely initialized when they are created.
                    690: This simplifies both the code generator
                    691: and the runtime system, though the trick described in the previous paragraph
                    692: will work for more conventional code generators.
                    693: 
                    694: Since there's no runtime stack, the 
                    695: {\em call-with-current-continuation}\cite{rees86}
                    696: \index{call-with-current-continuation}
                    697: primitive can be implemented very efficiently.  In implementations
                    698: with a runtime stack, the entire stack must be copied when {\em call/cc}
                    699: is evaluated (or else there must be a lot of extra complexity in stack
                    700: management); without a stack, the execution of
                    701: {\em call/cc}, and the execution of
                    702: saved continuations, take just a few instructions each.  This makes
                    703: {\em call/cc} a practical programming tool, just as fast allocation
                    704: makes {\em cons} practical.
                    705: 
                    706: \section{Representing ML structures in records and strings}
                    707: \label{representing}
                    708: Section~\ref{data} describes just two kinds of 
                    709: objects---records\index{record} and
                    710: strings\index{string}---referenceable at their beginning and (in a limited way)
                    711: at interior points.  All of the kinds of ML data can be represented
                    712: in records and strings.
                    713: 
                    714: An ML value must be representable in one word.  A larger value can be {\em
                    715: boxed} by putting it in several words on the heap and keeping a (one-word)
                    716: pointer to it.  A small value (like a 31-bit integer)
                    717: can be kept {\em unboxed\index{unboxed values}} by representing it without indirection in
                    718: a machine word.
                    719: Boxed\index{boxed values}
                    720: values (pointers)
                    721: are distinguished from unboxed values (integers, or data represented as
                    722: integers)
                    723: by their low-order bit.
                    724: 
                    725: An ML {\bf record\index{record}}
                    726: is an $n$-tuple of values.  It has a natural representation
                    727: as a record in our runtime data format.  ML has a {\it pro forma}
                    728: record of length 0; as our runtime data format does not allow objects of length
                    729: 0 (since that would leave no room to put a forwarding pointer, as described in
                    730: a section~\ref{forwarding}),
                    731: the empty record
                    732: is represented as the unboxed integer 0.  This does no harm, since
                    733: no fields can be selected from an empty record anyway.
                    734: 
                    735: An ML {\bf array\index{array}} is also an $n$-tuple of values.
                    736: In the ML language, 
                    737: record-field offsets are determined at compile-time, whereas
                    738: arrays may be indexed by runtime values;
                    739: and arrays may be modified after
                    740: they are created, whereas records are immutable.
                    741: But neither of these
                    742: differences matters to the runtime system.
                    743: The polymorphic equality function of ML requires distinguishing between
                    744: mutable and immutable objects at runtime, so arrays and records
                    745: have different tags, however; this requires that they have different
                    746: tags.
                    747: 
                    748: ML does permit 
                    749: arrays\index{arrays of length 0} of length 0,
                    750: but our runtime data format does not.
                    751: Happily, all arrays of length 0 have
                    752: the same behaviour, so a special object of length 0 (located outside
                    753: the garbage-collectible region) serves to represent all the
                    754: empty arrays.
                    755: 
                    756: {\bf References} in ML are mutable objects:  
                    757: \verb"val a = ref 5" declares a reference\index{reference variables}
                    758: variable \verb"a" that may be
                    759: changed by a later assignment statement, unlike most variables which cannot
                    760: be modified once defined.  Reference variables behave just like single-element
                    761: arrays, and that's how they are represented in the runtime system.
                    762: 
                    763: ML has {\bf datatype\index{datatype}}
                    764: declarations to allow tagged variant types.  For example,
                    765: the declaration
                    766: \begin{verbatim}
                    767: datatype t = NAME of string | NUMBER of int
                    768: \end{verbatim}
                    769: specifies a type \verb"t" that can be either a string or an integer,
                    770: depending on whether the constructor\index{constructor} \verb"NAME"
                    771: or \verb"NUMBER" is used to create it.
                    772: The program can examine any object of type \verb"t" and determine whether
                    773: it has the \verb"NAME" or \verb"NUMBER" representation---that is, it
                    774: is a {\em tagged} union.  An object of type \verb"t" is represented
                    775: as a two-element record: one field contains the tag (a small integer),
                    776: and the other field contains the value.
                    777: The details of constructor representation may be of interest
                    778: only to those knowledgeable in ML, and are described in 
                    779: section~\ref{constructor}.
                    780: 
                    781: ML has {\bf strings\index{string}}
                    782: of characters.  Strings can be of any non-negative length.
                    783: Since our runtime representation cannot support objects of length 0
                    784: (because one word is needed to store the forwarding pointer, as described
                    785: in a section~\ref{forwarding}),
                    786: the empty string\index{empty string}\index{null string}
                    787: \index{string of length 0}
                    788: must be treated specially.  However, it turns out that
                    789: it is never necessary to create a new empty string; the only occurrences
                    790: of empty strings are as literals in the program text.  String
                    791: literals (and back-pointers) will be discussed in section~\ref{function}.
                    792: 
                    793: Strings\index{Strings of length one} of length one are treated specially by the compiler,
                    794: though this is not necessary either for the ML language or for our runtime
                    795: data format.  Single-character strings\index{single-character strings}
                    796: \index{characters}
                    797: are treated as unboxed integers
                    798: between 0 and 255, to avoid heap-allocation for this (frequent) special case.
                    799: Though this technique may cause
                    800: less allocation, it requires special tests on every
                    801: string operation.  It's not clear whether this special case saves more than
                    802: it costs.  However, this special case is transparent to the runtime system
                    803: anyway; all conversions, etc. are handled explicitly by the compiler;
                    804: the single-character strings are not a new runtime data format, but
                    805: look like ordinary unboxed integers.
                    806: 
                    807: {\bf Byte-arrays\index{byte-arrays}}
                    808: in ML are to strings as arrays are to records:  they have the
                    809: same representation as strings, but their tag is different to facilitate
                    810: certain language features.
                    811: 
                    812: {\bf Floating-point\index{floating-point}} numbers are too large to fit in 
                    813: one word.  Even if we chose
                    814: to use single-precision floating point, it would be difficult to store
                    815: them unboxed\index{unboxed}
                    816: because there is no bit available for use as a tag\index{tag} bit.
                    817: Thus, all floating-point numbers are stored boxed, with 8 bytes of data and
                    818: one word of descriptor.
                    819: The descriptor\index{descriptor}
                    820: must be one that indicates that the contents of the
                    821: object contain no pointers.
                    822: Since ML is statically typed, the language
                    823: never needs to distinguish (at runtime) floating-point numbers from
                    824: strings, so we can just represent a double-precision float as an 8-byte
                    825: string.  This is not an ASCII string, it is the hardware representation;
                    826: a ``string'' descriptor does not denote that the contents are printable
                    827: characters, just that there is an arbitrary bit-pattern not containing
                    828: pointers.
                    829: 
                    830: Similarly, {\bf machine-code procedures}
                    831: \index{machine code} do not require a separate class
                    832: of descriptor.  From the garbage collector's point of view, machine
                    833: code is just like strings: it contains bits that are to be (perhaps)
                    834: moved from place to place but which never represent pointers.
                    835: Thus, machine code is just kept in string objects.  The mechanism by
                    836: which we avoid placing pointers\index{pointers}
                    837: (or any reference to other objects)
                    838: in machine code will be described in section~\ref{modules}.
                    839: 
                    840: \section{Closures}
                    841: \label{closures}
                    842: \index{closure}
                    843: In ML, as in many lexically scoped languages, functions may have free
                    844: variables.  That is, the body of a function may reference not only its
                    845: own formal parameters\index{formal parameters}
                    846: (the bound variables of the function), but the
                    847: formal parameters of a statically enclosing function.  Consider the
                    848: function 
                    849: \begin{verbatim}
                    850:     fun add(x) = let fun add1(y) = x+y
                    851:                   in add1
                    852:                  end
                    853: \end{verbatim}
                    854: The function \verb"add(x)" defines an internal function \verb"add1(y)" that
                    855: adds \verb"x" and \verb"y", and then \verb"add" returns \verb"add1" as its
                    856: result.  The variable \verb"x" is bound by \verb"add", and
                    857: \verb"y" is bound by \verb"add1".  Therefore, if we just consider the
                    858: function definition \verb"fun add1(y) = x+y", the variable \verb"x"
                    859: is a free variable\index{free variable} of this internal function.
                    860: 
                    861: In order to evaluate \verb"add1" applied to some argument, we must have
                    862: a context in which the value \verb"x" is defined.  It will not suffice
                    863: for the \verb"add" function just to return a pointer to the machine code
                    864: of \verb"add1" as its result.  Instead, it must return a {\em closure}:
                    865: a combination of a code-pointer an some information about the values of
                    866: all the free variables (in this case, just \verb"x").
                    867: 
                    868: For example, the result of evaluating \verb"add(5)" is (abstractly) the
                    869: function which will always add 5 to its argument.  
                    870: Figure~\ref{add5} shows a possible representation
                    871: as a record object, with a descriptor at  the beginning for the
                    872: garbage collector.
                    873: \begin{figure}[htbp]
                    874: \label{add5}
                    875: \begin{verbatim}
                    876:      +------+
                    877:      | desc |
                    878:      +------+
                    879: p--> |  o------------> machine code for add1
                    880:      +------+
                    881:      |  5   |
                    882:      +------+
                    883: \end{verbatim}
                    884: \caption{A simple closure.}
                    885: \end{figure}
                    886: The pointer \verb"p" is the runtime representation of the particular version
                    887: of \verb"add1" in which \verb|x| is bound to 5.  This pointer may be passed
                    888: as an argument to some other function that doesn't know anything about
                    889: the \verb"add1" function.  All the other function has to know is how
                    890: to ``call'' a closure.  The calling sequence might be:
                    891: \begin{enumerate}
                    892: \item Put \verb|p| in register 0.
                    893: \item Put the return address in register 1.
                    894: \item Fetch p[0] into register 2.
                    895: \item Jump to the address contained in register 2.
                    896: \end{enumerate}
                    897: Step 1 is necessary so that when \verb"add1" is entered, it can gain
                    898: access to the closure record where the binding of \verb"x" is stored.
                    899: Step 2 is just the normal saving of a return address, which might
                    900: in many cases be on a stack instead of in a register.  Steps 3 and
                    901: 4 are just the typical code necessary to jump to a runtime-bound address.
                    902: 
                    903: It is important to note that the function that calls a closure \verb"p"
                    904: doesn't need to access any part of it except the machine-code pointer
                    905: in \verb|p[0]|.  This means that the free variables (the rest of the
                    906: closure) may have an arbitrary arrangement, known only to the
                    907: function that builds the closure (\verb"add", in this case) and the
                    908: function that executes it (\verb"add1", in this case).  Closures can
                    909: be structured in many ways\cite{appel88:clo}, and can even point
                    910: to other closures for access to free variables (i.e., they can
                    911: have ``static links\index{static link},'' in Algol terminology).
                    912: 
                    913: Clearly, closures are represented a lot like records.  A closure has several
                    914: fields, each of which is either a machine-code pointer or a free variable.
                    915: The machine-code pointer is just a (boxed) string pointer,
                    916: and the free variables are
                    917: just one-word ML objects.  Therefore, we use record\index{record}
                    918: objects to represent closures.
                    919: 
                    920: When several functions share a set of free variables, it is convenient
                    921: to have a {\bf multiple-function closure.}
                    922: \index{closure for multiple functions}
                    923: \index{multiple functions closure}
                    924: \index{shared closure}
                    925: Consider the functions
                    926: \begin{verbatim}
                    927: fun test(y,z) =
                    928:     let fun even(i) = if i=0 then y else odd(i-1)
                    929:         and odd(i)  = if i=0 then z else even(i-1)
                    930:      in (even,odd)
                    931:     end
                    932: \end{verbatim}
                    933: This function, when applied to arguments \verb"(3,7)", would return
                    934: two functions (let's call them \verb"even37" and \verb"odd37").
                    935: from integer to integer.  \verb"Even37" applied to even integers
                    936: returns 3 and applied to odd integers returns 7, and \verb"odd37"
                    937: works the other way around.
                    938: 
                    939: We could represent these functions by two different closures, as shown
                    940: in figure~\ref{twoclos}.
                    941: \begin{figure}[htbp]
                    942: \label{twoclos}
                    943: \begin{verbatim}
                    944: 
                    945:            +------+
                    946:            | desc |
                    947:            +------+
                    948: even37 --> |  o------------> machine code for even
                    949:            +------+
                    950:          y |  3   |
                    951:            +------+
                    952:        odd |  o-------> odd37
                    953:            +------+
                    954: 
                    955:            +------+
                    956:            | desc |
                    957:            +------+
                    958:  odd37 --> |  o------------> machine code for odd
                    959:            +------+
                    960:          z |  7   |
                    961:            +------+
                    962:       even |  o--------> even37
                    963:            +------+
                    964: \end{verbatim}
                    965: \caption{Two mutually recursive closures}
                    966: \end{figure}
                    967: This is particularly unfortunate because \verb"even" and \verb"odd"
                    968: are free variables\index{free variables}
                    969: of each other, and this requires the construction
                    970: of a cyclic data structure of closures, which can get complicated.
                    971: 
                    972: A more clever trick\cite{appel87:sml}\cite{kranz86} is to let these
                    973: two functions share a closure, as shown in figure~\ref{oneclos}.
                    974: \begin{figure}[htbp]
                    975: \label{oneclos}
                    976: \begin{verbatim}
                    977:            +------+
                    978:            | desc |
                    979:            +------+
                    980: even37 --> |  o------------> machine code for even
                    981:            +------+
                    982:  odd37 --> |  o------------> machine code for odd
                    983:            +------+
                    984:          y |  3   |
                    985:            +------+
                    986:          z |  7   |
                    987:            +------+
                    988: \end{verbatim}
                    989: \caption{Two functions sharing a closure}
                    990: \end{figure}
                    991: Now, when \verb"even37" is called, its closure-pointer will be loaded
                    992: in register 0, and it knows that it can access \verb:y: at offset 2 from
                    993: register 0.  And when \verb"odd37" is called, its closure pointer
                    994: (which is really a pointer to the second field of the record) will be
                    995: loaded in register 0, and it can access \verb"z" at offset 2.
                    996: In some situations, the same free variable may be accessed by different
                    997: functions in the closure, and the code generator will have to remember
                    998: that the offsets\index{offset} from register 0 to a particular field
                    999: depend on which closure-pointer is loaded (i.e. which function we are
                   1000: generating code for).
                   1001: 
                   1002: Finally, if \verb"even" wants to call \verb"odd", it can just generate
                   1003: the appropriate closure value in register 0 by adding an offset of 1
                   1004: to its own closure pointer.  In general, all mutually recursive
                   1005: functions can be handled this way, and cyclic\index{cyclic closure}
                   1006: closure structures
                   1007: are never required.  But even non-mutually-recursive sets of functions
                   1008: can save storage by sharing closures.
                   1009: 
                   1010: It should now be clear why (in section~\ref{data}) we arranged for
                   1011: pointers into the middle of record objects.  Given the pointer
                   1012: \verb"odd37", the garbage collector can easily search backward until
                   1013: it finds the descriptor\index{descriptor}
                   1014: of the record; the descriptor is unboxed,
                   1015: while all of the previous machine-code pointers are boxed fields.
                   1016: 
                   1017: \section{Function entry points}
                   1018: \label{function}
                   1019: 
                   1020: A typical compilation\index{compilation unit} unit
                   1021: will contain many functions, and it would be
                   1022: unwieldy to make a different string object for each function:
                   1023: since the functions in a compilation unit often call each other,
                   1024: these calls will go much faster as pc-relative jumps than if
                   1025: we had to fetch addresses from closures.  But to achieve pc-relative
                   1026: jumps, the relative distance between two functions cannot change;
                   1027: and the garbage collector moves objects around.
                   1028: So to achieve
                   1029: this constant distance, we must put several functions in the
                   1030: same string object, and have several
                   1031: entry points.  
                   1032: 
                   1033: For example, the \verb"even", \verb"odd", and \verb"test"
                   1034: functions of section~\ref{closures} could all be placed a single string,
                   1035: \index{embedded string}
                   1036: with appropriate back-pointer descriptors (figure~\ref{embedded1}).
                   1037: \begin{figure}[htbp]
                   1038: \label{embedded1}
                   1039: \begin{verbatim}
                   1040:            +------+
                   1041:            | desc |
                   1042:            +------+
                   1043:   test --> |  t   |
                   1044:            +  e   +
                   1045:            |  s   |
                   1046:            +  t   +
                   1047:            |      |
                   1048:            +------+
                   1049:            | [5]  |
                   1050:            +------+
                   1051:   even --> |  e   |
                   1052:            +  v   +
                   1053:            |  e   |
                   1054:            +  n   +
                   1055:            |      |
                   1056:            +------+
                   1057:            | [10] |
                   1058:            +------+
                   1059:    odd --> |  o   |
                   1060:            +  d   +
                   1061:            |  d   |
                   1062:            +------+
                   1063: \end{verbatim}
                   1064: \caption{Three code strings embedded in a string}
                   1065: \end{figure}
                   1066: The numbers \verb"[5]" and \verb"[10]" are back-pointers, formatted
                   1067: as described in section~\ref{data}.  They tell the garbage collector
                   1068: the offset to the beginning of the object, so it can find the true descriptor.
                   1069: 
                   1070: The advantage to putting several functions in the same string is that
                   1071: they can refer to each other directly, without having to access each
                   1072: other through closures.  Of course, it will be necessary to adjust
                   1073: the closure\index{closure}
                   1074: pointer in register 0.  Here, \verb"even" could call \verb"odd"
                   1075: by this (simpler) calling sequence:
                   1076: \begin{enumerate}
                   1077: \item Add 1 to register 0  (to adjust p).
                   1078: \item Put the return address in register 1.
                   1079: \item Jump to \verb"odd".
                   1080: \end{enumerate}
                   1081: 
                   1082: The garbage collector may move a code string from one place to another
                   1083: in memory, so it is necessary that the jump in step 3 be a 
                   1084: pc\index{pc}-relative
                   1085: jump.
                   1086: 
                   1087: ML programs may contain literal\index{literal}
                   1088: \index{string literal}
                   1089: strings and floating\index{floating point constant} point constants.
                   1090: These are embedded within string objects amongst the machine-code
                   1091: functions.  For example,
                   1092: \begin{verbatim}
                   1093: fun show(b) = if b then "true" else "false"
                   1094: \end{verbatim}
                   1095: could almost have the representation shown in figure~\ref{embedded2}.
                   1096: \begin{figure}[htbp]
                   1097: \label{embedded2}.
                   1098: \begin{verbatim}
                   1099:            +------+
                   1100:            | desc |
                   1101:            +------+
                   1102:   show --> |  s   |
                   1103:            +  h   +
                   1104:            |  o   |
                   1105:            +  w   +
                   1106:            |      |
                   1107:            +------+
                   1108:            | [5]  |
                   1109:            +------+
                   1110:   true --> | true |
                   1111:            +------+
                   1112:            | [7]  |
                   1113:            +------+
                   1114:  false --> | fals |
                   1115:            +      +
                   1116:            | e000 |
                   1117:            +------+
                   1118: \end{verbatim}
                   1119: \caption{String literals, oversimplified}
                   1120: \end{figure}
                   1121: The function \verb"show" returns a pointer to \verb"true" or \verb"false";
                   1122: this pointer points into the middle of the large string object at one
                   1123: of the embedded strings.  Because the embedded string is preceded by a
                   1124: back-pointer descriptor, the garbage collector won't be confused.
                   1125: 
                   1126: There's a slight problem with this.  Although the garbage collector doesn't
                   1127: need to know the length of the string \verb|"true"|, the ML program might
                   1128: apply the \verb"length" function to it.  
                   1129: String\index{string length} lengths are kept in the
                   1130: descriptor word, and are accessed by fetching field $-1$ and shifting
                   1131: off the tag bits.  The problem here is that the descriptor for \verb"true"
                   1132: is a back-pointer\index{back-pointer},
                   1133: not a string descriptor.  The solution---simple and
                   1134: inelegant---is to introduce a new kind of descriptor, an {\em embedded
                   1135: string}, which contains the length of the string and always is preceded
                   1136: by a backpointer.  Thus, the actual representation of the \verb"show"
                   1137: function is more like figure~\ref{embedded3}.
                   1138: \begin{figure}[htbp]
                   1139: \label{embedded3}
                   1140: \begin{verbatim}
                   1141:            +------+
                   1142:            | desc |
                   1143:            +------+
                   1144:   show --> |  s   |
                   1145:            +  h   +
                   1146:            |  o   |
                   1147:            +  w   +
                   1148:            |      |
                   1149:            +------+
                   1150:            | [6]  |
                   1151:            +------+
                   1152:            | 4emb |
                   1153:            +------+
                   1154:   true --> | true |
                   1155:            +------+
                   1156:            | [9]  |
                   1157:            +------+
                   1158:            | 5emb |
                   1159:            +------+
                   1160:  false --> | fals |
                   1161:            +      +
                   1162:            | e000 |
                   1163:            +------+
                   1164: \end{verbatim}
                   1165: \caption{String literals embedded in a code string}
                   1166: \end{figure}
                   1167: 
                   1168: The treatment of embedded \index{floating point constant}
                   1169: floating point constants is just like that of
                   1170: embedded strings.  However, the ML program will never need to take the
                   1171: length of a floating point constant, so the embedded-string descriptor
                   1172: can be omitted and a simple backpointer\index{backpointer} will suffice.
                   1173: Note that embedded strings of length 0 are permitted, since there
                   1174: is no need for a word to install a forwarding pointer (see 
                   1175: section~\ref{forwarding},
                   1176: and note that forwarding pointers go only at the beginning of the
                   1177: entire string object, not at embedded strings).
                   1178: 
                   1179: The address of string and floating constants may appear in machine code,
                   1180: but only in a pc\index{pc}-relative way, since code objects may be moved in memory
                   1181: by the garbage collector\index{garbage collector}.
                   1182: 
                   1183: \section{Modules and compilation units}
                   1184: \label{modules}
                   1185: 
                   1186: Standard ML has a powerful module\index{module}
                   1187: system, allowing nested modules,
                   1188: modules as parameters to other modules, thinning of a richer module
                   1189: to form a smaller module, etc.  The module system can be represented
                   1190: entirely as records and functions in the runtime system\cite{appel87:sml};
                   1191: no new machinery is needed.
                   1192: 
                   1193: A compilation unit\index{compilation unit}
                   1194: is just a sequence of function and value declarations that
                   1195: will be compiled together into one string object.  But one compilation
                   1196: unit may refer to values defined in another, and a linkage\index{linkage}
                   1197: mechanism is required.
                   1198: 
                   1199: We can use the power of higher-order functions to implement the linkage
                   1200: in a way that is transparent to the runtime system.  If unit \verb"B"
                   1201: refers to a value in unit \verb"A", then the compiler will treat
                   1202: \verb"A" as an implicit parameter of \verb"B".  For example, suppose we have
                   1203: two compilation units
                   1204: \begin{verbatim}
                   1205: fun f(x,y) = x+y
                   1206: \end{verbatim}
                   1207: and
                   1208: \begin{verbatim}
                   1209: fun g(z) = f(z,z)
                   1210: \end{verbatim}
                   1211: clearly \verb"f" is a free variable\index{free variable}
                   1212: of \verb"g".  But the compiler can
                   1213: parametrize the second unit, as if it were written
                   1214: \begin{verbatim}
                   1215: fun g0(f) = let fun g(z) = f(z,z)
                   1216:             in g
                   1217:            end
                   1218: \end{verbatim}
                   1219: This compilation unit has no free variables; the ML system will simply
                   1220: apply the second unit to the first one, yielding the desired function
                   1221: \verb"g".  The compiler must keep track of inter-module references
                   1222: in order to do this, but at least the runtime mechanism for linkage is simple.
                   1223: 
                   1224: So, each compilation unit is a closed function with no free variables.
                   1225: No special linkage mechanism is built into the runtime system; the
                   1226: closure\index{closure}
                   1227: mechanism handles linkage very elegantly.
                   1228: 
                   1229: Eliminating free variables from compilation units also simplifies the
                   1230: code generator\index{code generator}.
                   1231: Whenever the code generator must analyze local free variables,
                   1232: generate pc\index{pc}-relative references, etc.,
                   1233: its job is much simpler because there
                   1234: are no global references to other objects.  The interface between the
                   1235: front end of the compiler and the code generator are much cleaner as a result.
                   1236: 
                   1237: \section{Linkage to assembly language}
                   1238: \label{linkage}
                   1239: 
                   1240: Some primitives of a programming language are best implemented in a different
                   1241: language (typically assembly language\index{assembly language}).
                   1242: Standard ML of New Jersey
                   1243: makes use of 8 functions written in assembly language, and 8 functions written
                   1244: in C\index{C}.
                   1245: (For comparison, more than 100 standard-library\index{standard library}
                   1246: functions are written
                   1247: in ML.)  The assembly-language functions are:
                   1248: \begin{enumerate}
                   1249: \item {\bf array\index{array}(n,x)} creates an array of length $n$, each element initialized
                   1250: to $x$.
                   1251: \item {\bf callc\index{callc}(f,a)} calls a C-language function $f$ with argument $a$.
                   1252: \item {\bf create\_b(n)} creates\index{create\_b} a byte-array of length $n$.
                   1253: \item {\bf create\_s(n)} creates\index{create\_s} a string of length $n$.
                   1254: \item {\bf floor\index{floor}(x)} returns the smallest integer less than
                   1255: or equal to $x$.
                   1256: \item {\bf logb\index{logb}(x)} returns the exponent part of (the floating-point) $x$.
                   1257: \item {\bf scalb\index{scalb}(x)} inserts a new exponent into (floating-point) $x$.
                   1258: \item {\bf syscall\index{syscall}(i,args,k)} does operating system
                   1259: \index{operating system} kernel-call $i$ with
                   1260: $k$ arguments.
                   1261: \end{enumerate}
                   1262: The assembly-language functions are located at (constant) addresses within
                   1263: the runtime system.  But it's a good idea to follow the rules about
                   1264: free variables\index{free variables}
                   1265: in module-linkages\index{linkage}: even the references to assembly-language
                   1266: primitives should not be ``hard-wired'' addresses in code objects.
                   1267: If this were done, then all ML code would have to be re-compiled whenever
                   1268: the runtime system was re-compiled.
                   1269: 
                   1270: Instead, we just treat the assembly-language functions as a special
                   1271: record object containing nine elements.  Other modules are parametrized
                   1272: by this \verb"Assembly" module just as if it were an ordinary one.
                   1273: All the assembly-language functions follow the ML calling conventions.
                   1274: 
                   1275: Most of the C functions are to provide access to system calls with
                   1276: hard-to-manage interfaces, like \verb"fork", etc.
                   1277: There is an added complication that the C and ML calling conventions
                   1278: don't match.
                   1279: The assembly-language function \verb"callc" arranges the arguments
                   1280: for an ML function to call a C function.  The details are uninteresting,
                   1281: but the point is that absolute references are again avoided.
                   1282: 
                   1283: \begin{quotation}
                   1284: {\small The file \verb"boot/assembly.sig" contains the signature
                   1285: \verb"Assembly", approximately as in figure~\ref{assembly}.
                   1286: \begin{figure}
                   1287: \label{assembly}
                   1288: \begin{verbatim}
                   1289: signature ASSEMBLY =
                   1290:   sig
                   1291:       datatype datalist = DATANIL | DATACONS of (string * string * datalist)
                   1292:       type func
                   1293:       datatype funclist = FUNC of (func * string * funclist)
                   1294:       type object
                   1295:       structure A : sig
                   1296:         val array : int * 'a -> 'a array
                   1297:         val callc : 'b (* func*)  * 'a -> 'c
                   1298:         val create_b : int -> string
                   1299:         val create_s : int -> string
                   1300:         val floor : real -> int
                   1301:         val logb : real -> int
                   1302:         val scalb : real * int -> real
                   1303:         val syscall : int * string list * int -> int
                   1304:       end
                   1305:       exception Div
                   1306:       exception Float of string
                   1307:       exception Interrupt
                   1308:       exception Overflow
                   1309:       exception SystemCall of string
                   1310:       exception UnboundTable
                   1311:       val array0 : 'a array
                   1312:       val bytearray0 : string
                   1313:       val collected : int ref
                   1314:       val collectedfrom : int ref
                   1315:       val current : string ref
                   1316:       val datalist : datalist
                   1317:       val external : funclist
                   1318:       val gcmessages : int ref
                   1319:       val gcprof : string ref
                   1320:       val majorcollections : int ref
                   1321:       val minorcollections : int ref
                   1322:       val opsys : int   (* 1 = vax bsd ultrix, 4.2, 4.3
                   1323:                            2 = sunos 3.0, 4.0 
                   1324:                            3 = vax v9 (bell labs) *)
                   1325:       val pstruct : object ref
                   1326:       val ratio : int ref
                   1327: end
                   1328: \end{verbatim}
                   1329: \caption{The {\tt ASSEMBLY} signature}
                   1330: \end{figure}
                   1331: \index{Assembly structure}
                   1332: The substructure \verb"A" is the machine-dependent part (implemented
                   1333: in assembly language), and the other components (exceptions, constants,
                   1334: etc.) are just data structures that can be described machine-independently
                   1335: in C.  
                   1336: \verb"A" is arranged as an ML record 
                   1337: at the label \verb"runvec"\index{runvec}
                   1338: in the file \verb"VAX.prim.s", \verb"M68.prim.s",
                   1339: etc.  The machine-independent part
                   1340: is in \verb"cstruct.c"\index{cstruct.c}.
                   1341: 
                   1342: The exceptions in this structure are just those that can be raised
                   1343: from C or assembly language.  Exceptions as elaborated by the compiler
                   1344: look like references to strings, and in \verb"cstruct.c" are implemented
                   1345: as initialized structures to mimic this
                   1346: with all the appropriate descriptors.
                   1347: 
                   1348: The values \verb"array0" and \verb"bytearray0" are just the array and
                   1349: \index{array0}\index{array of length 0}
                   1350: \index{bytearray0}
                   1351: byte-array of length 0; this is necessary because objects of length 0
                   1352: are not permitted in the garbage-collectible region.
                   1353: 
                   1354: \index{collected}\index{collectedfrom}
                   1355: \index{majorcollections}\index{minorcollections}\index{ratio}
                   1356: The references \verb"collected", \verb"collectedfrom",
                   1357: \verb"majorcollections", and \verb"minorcollections" are updated by
                   1358: the garbage collector with performance information.  The \verb"ratio"
                   1359: variable can be set from ML to tell the garbage collector what
                   1360: ratio to maintain between heap size and amount of live data.
                   1361: 
                   1362: \index{current}\index{profiling}
                   1363: The variable \verb"current" is used in execution profiling\cite{appel88:prof};
                   1364: it is a pointer
                   1365: to  an array of length 2 of ML integers.  When profiling is enabled,
                   1366: a timer interrupt will periodically
                   1367: cause current[1] to be incremented.
                   1368: 
                   1369: \index{pstruct}
                   1370: \index{loader}
                   1371: \index{Initial}
                   1372: The \verb"pstruct" is a pointer to the \verb"Initial" (pervasive)
                   1373: structure.  The ML loader (\verb"boot/loader.sml") needs this
                   1374: because other modules may reference the structure \verb"Initial".
                   1375: 
                   1376: \index{datalist}
                   1377: The \verb"datalist" is used for linkage to sharable compiled ML code.
                   1378: In vanilla Unix, such code must be link-loaded into the text segment
                   1379: of the sml executable file, and the \verb"datalist" provides the ML
                   1380: loader a way to get to it.  Each element of the data list contains
                   1381: two strings (the name of the module, and the executable code), along
                   1382: with a link to the next element.
                   1383: 
                   1384: \index{funclist}
                   1385: The \verb"funclist" is similarly used for linkage to functions implemented
                   1386: in C.  These functions could be described in the Assembly signature
                   1387: but this requires recompilation whenever a new function is added.
                   1388: The funclist is terminated by a function whose name is "xxxx".  All functions
                   1389: in the list must have a name of exactly 4 characters.
                   1390: All C-functions must take exactly one argument in ML format and return
                   1391: a result in ML format.  The functions implemented in C at this writing are:
                   1392: \begin{itemize}
                   1393: \item{\bf fion\index{fion}} tells how many characters may be read from an open
                   1394: file-descriptor without blocking.
                   1395: \item{\bf fork\index{fork}} does a Unix process fork.
                   1396: \item{\bf prof\index{prof}} is obsolete.
                   1397: \item{\bf syst\index{syst}} executes a shell command as a sub-process.
                   1398: \item{\bf time\index{time}} gets execution-time information from the operating system
                   1399: and the garbage collector.
                   1400: \item{\bf argv\index{argv}} gets the command-line arguments of the executing program
                   1401: (as a list of strings).
                   1402: \item{\bf envi\index{envi}} gets the Unix environment\index{environment}
                   1403: string (as a list of strings).
                   1404: \item{\bf blas\index{blas}} does a structured\index{structured write} write (as described section~\ref{structured}).
                   1405: \item{\bf salb\index{salb}} does a structured read.
                   1406: \end{itemize}
                   1407: 
                   1408: These functions should be careful not to allocate memory using \verb"malloc".
                   1409: \index{malloc}
                   1410: 
                   1411: All of these functions are called from the ML ``thread'' (see 
                   1412: section~\ref{fast}).
                   1413: Some C functions may require the (temporary) suspension of the ML thread,
                   1414: e.g. to do a garbage collection for a structured write.
                   1415: By setting the global variable \verb"cause"\index{cause} to a nonzero
                   1416: value and then returning, a function may cause control to be grabbed
                   1417: immediately thereafter in the thread\index{thread} controller.
                   1418: }
                   1419: \end{quotation}
                   1420: 
                   1421: \section{Access to operating system services}
                   1422: \label{access}
                   1423: 
                   1424: A typical operating system\index{operating system}
                   1425: provides many services, each with one or
                   1426: more operations (``system calls'').  A very early version of Standard ML of
                   1427: New Jersey made little use of these services, so it sufficed to write
                   1428: an assembly-language interface to each desired system call, and then
                   1429: call the assembly-language functions from ML programs.  As the ML environment
                   1430: grew more sophisticated, it needed more operating system services, so the
                   1431: number of assembly-language interface functions grew.  This eventually
                   1432: became intolerable.
                   1433: 
                   1434: Now there is just one assembly-language interface function, \verb"syscall",
                   1435: \index{syscall}
                   1436: that takes a list of arguments and a system-call number.  This function
                   1437: pushes the arguments onto the stack and makes the system call.  The ML standard
                   1438: library hides the \verb"syscall" function behind a variety of typechecked
                   1439: functions.  Implementing these functions in ML rather than assembly language
                   1440: is better for two reasons: ML is safer and easier to program in, and (more
                   1441: important) the ML code doesn't need to be rewritten for each target
                   1442: architecture.
                   1443: 
                   1444: \begin{quotation}
                   1445: {\small All Unix system calls return integer values.  These are
                   1446: converted into ML integers (by shifting and incrementing) and returned
                   1447: as the result of \verb"syscall".  Many system calls take parameters
                   1448: by reference that they stuff results into.  These results are (of course)
                   1449: not appropriately tagged for ML.  A byte-array (of the
                   1450: appropriate length) is used 
                   1451: as the actual parameter for such an argument, and then
                   1452: the results can be extracted from the byte-array after the system call.
                   1453: 
                   1454: Integer arguments to system calls are shifted right by \verb"syscall"
                   1455: to convert them into their (untagged) representation.
                   1456: 
                   1457: String and byte-array arguments are left alone, yielding their natural
                   1458: representation.  However, strings\index{string}
                   1459: in C (and Unix) are terminated
                   1460: by a null character, whereas strings in ML have a length prefix.
                   1461: The prefix is ignored by the kernel, since it occurs at offset -4 (bytes).
                   1462: To achieve null termination of strings, you may take the actual ML argument
                   1463: and concatenate the string \verb|"\000\000"| of two zero characters (two
                   1464: characters are required because of the unboxed representation of
                   1465: single-character strings, and the possibility that the actual argument
                   1466: is an empty string).
                   1467: }
                   1468: \end{quotation}
                   1469: \section{Constructor representations}
                   1470: \label{constructor}
                   1471: Those unfamiliar with ML might wish to skip this section.
                   1472: 
                   1473: Some of the constructors\index{constructor}
                   1474: in an ML datatype\index{datatype} may not carry values.
                   1475: For example, in the type
                   1476: \begin{verbatim}
                   1477: datatype 'a option = NONE | SOME of 'a
                   1478: \end{verbatim}
                   1479: the \verb"SOME" constructor carries value of type \verb"'a"
                   1480: and the
                   1481: \verb"NONE" constructor carries no value.  Constructors that carry no value
                   1482: are represented not as two-element records, but as (unboxed) small integers.
                   1483: In this case, \verb"NONE"
                   1484: is represented by the integer 0 (with a low-order 1 tag bit),
                   1485: and \verb"SOME(x)" is represented by a two-element record containing the value
                   1486: \verb"x" and a small integer representing the \verb"SOME" constructor.
                   1487: The ML program can distinguish which constructor has been used by
                   1488: testing for ``boxity:''\index{boxity}
                   1489:  \verb"NONE" is an unboxed value\index{unboxed value}, \verb"SOME(x)"
                   1490: is a boxed value\index{boxed value},
                   1491: and the low-order bit will distinguish them.
                   1492: 
                   1493: Finally, in the case that there is only one value-carrying constructor,
                   1494: and the value carried always has a boxed representation, the extra
                   1495: indirection record can be eliminated.  Thus, for the {\bf list} datatype
                   1496: \begin{verbatim}
                   1497: datatype 'a list = NIL | CONS of ('a * 'a list)
                   1498: \end{verbatim}
                   1499: the constructor \verb"NIL" can be represented as the unboxed integer 0, the
                   1500: constructor \verb"CONS(a,b)" can be represented as a two-element record
                   1501: containing \verb"a" and \verb"b".  Again, the low-order ``boxity'' bit
                   1502: distinguishes the constructors.  If there had been several value-carrying
                   1503: constructors, then the boxity bit alone could not distinguish them,
                   1504: and an indirection record (containing the carried value and a small
                   1505: integer denoting the constructor) would be required.
                   1506: 
                   1507: It is tempting to play more elaborate tricks with the representation of
                   1508: constructors.  For example, the {\em option} datatype described above
                   1509: might not seem to need an extra indirection; surely the boxity bit
                   1510: should be enough to distinguish between \verb"NONE" and \verb"SOME(x)"?
                   1511: The problem is that the value \verb"x" might itself be either boxed or
                   1512: unboxed; the indirection guarantees that \verb"SOME(x)" will indeed be boxed.
                   1513: 
                   1514: Similarly, it is tempting to use a clever representation for the datatype
                   1515: \begin{verbatim}
                   1516: type t = a * b
                   1517: datatype atom = NUMBER of int | RECORD of t
                   1518: \end{verbatim}
                   1519: Here, the values carried by the NUMBER constructor are always unboxed, and
                   1520: the values carried by the RECORD constructor are always boxed; the
                   1521: boxity can serve to distinguish them.
                   1522: 
                   1523: The problem with these schemes is caused by the abstraction provided by
                   1524: ML {\em functors\index{functor}}:
                   1525: a functor might take the atom datatype as a parameter
                   1526: with some of its structure hidden:
                   1527: \begin{verbatim}
                   1528: functor F(sig type t
                   1529:               datatype atom = NUMBER of int | RECORD of t
                   1530:           end)
                   1531: \end{verbatim}
                   1532: and here it is not at all clear that \verb"t" is always boxed.
                   1533: (Unfortunately, this problem applies to the ``clever'' representation of the {\em list}
                   1534: datatype as well, but we have chosen to ignore it there---we can
                   1535: detect the (rare) problems with partially-abstracted lists
                   1536: as functor parameters and give error messages, but we didn't want
                   1537: to use an extra indirection in the representation of lists.)
                   1538: The problem of functor/datatype interaction 
                   1539: could be considered a defect in the language design.
                   1540: 
                   1541: \begin{quotation}
                   1542: {\small Note that the constructor tag is element 1 of the record and
                   1543: the value is element 0.  There's no particular reason for this; it's
                   1544: probably a bad idea.  For exception constructors, the constructor
                   1545: tag is not a small integer, it is a \verb"string ref".  A ref cell is
                   1546: used instead of just a string so that (generative)
                   1547: exceptions may be compared
                   1548: for equality in pattern-matching exception\index{exception} constructors.
                   1549: }
                   1550: \end{quotation}
                   1551: \section{Suspending a process}
                   1552: \label{suspending}
                   1553: \index{exportML}
                   1554: Having designed a simple layout of runtime objects, we can now implement
                   1555: easily a variety of runtime services.
                   1556: 
                   1557: A convenient feature of a programming environment is the ability to save
                   1558: the current state of the computation in a file, so that by executing
                   1559: that file on a later date the computation can be resumed.  In an interactive
                   1560: system, one might wish to compile and load some programs, and then 
                   1561: ``save the world'' so that in the next session these programs don't
                   1562: have to be re-loaded.  This feature could also be useful for program
                   1563: checkpoints.
                   1564: 
                   1565: It's not too hard to suspend a process in this way.  It's even possible
                   1566: to make the resulting
                   1567: file an ordinary executable file.  An executable file
                   1568: contains a header, a text segment, and a data segment.  The (read-only)
                   1569: text segment
                   1570: of the saved file will be identical to the text segment of the currently
                   1571: executing process, so a single \verb"write" system call will serve to
                   1572: write it to the file.  The data segment of the file will consist of the
                   1573: original data segment of the process plus any newly allocated heap memory;
                   1574: again, one or two \verb"write"s will put it neatly into the file.
                   1575: (It is useful to do a garbage collection immediately before saving,
                   1576: to minimize the size of the file.)
                   1577: Finally, the header can be synthesized appropriately.
                   1578: 
                   1579: An early version of Standard ML of New Jersey used a runtime 
                   1580: stack\index{stack}, and
                   1581: we had no end of trouble getting the stack saved correctly.  A Unix executable
                   1582: file doesn't have a stack segment, so the stack had to be copied into the
                   1583: heap before saving.  And after restoring and copying back to the stack
                   1584: segment, we found that we had creamed the (new process's) command-line
                   1585: arguments, etc.  When we eliminated the runtime stack, these annoyances
                   1586: ceased.
                   1587: 
                   1588: \begin{quotation}
                   1589: {\small 
                   1590: \index{magic number}
                   1591: The resulting executable file has a ZMAGIC magic number on
                   1592: Berkeley Unix, and NMAGIC on SunOs 4.0.  This is because ZMAGIC
                   1593: implies dynamic loading of shared libraries on SunOs 4.0, which complicates
                   1594: the address map.  Creating executable files is annoyingly difficult
                   1595: in operating systems like SunOs and Mach, whose notion of a process
                   1596: address map is not as simple as in Berkeley Unix.
                   1597: }
                   1598: \end{quotation}
                   1599: \section{Structured writes and reads}
                   1600: \label{structured}
                   1601: 
                   1602: \index{structured write}
                   1603: It's often necessary to take a data structure (with many records, pointers,
                   1604: strings, etc.) and write it to a file in binary form so it can be read
                   1605: back in quickly.  It's always possible to do this in an {\em ad hoc} way
                   1606: for each different kind of data structure that has to be written, but
                   1607: it's sometimes difficult to get the pointer-sharing relations right.
                   1608: Since the runtime data format has enough information for the garbage-collector
                   1609: to traverse the structure, then a writer/reader must be able to traverse
                   1610: it as well.
                   1611: 
                   1612: In fact, we can view the saving of such a structure
                   1613: as just a garbage collection.
                   1614: The argument to the \verb"structured-write" function is just a root pointer,
                   1615: and \verb"structured-write" must traverse all the data accessible from this
                   1616: root.  In doing so, it must copy the data to a file, being careful
                   1617: to make only one copy of each record.
                   1618: 
                   1619: \index{garbage collection}
                   1620: We can, in fact, just invoke the \verb"gc" procedure, giving as the
                   1621: \verb"roots" argument just a singleton vector---the pointer that
                   1622: was handed to \verb"structured-write".  Then, the resulting {\em tospace}
                   1623: will be exactly the desired contents of the file, which can be written
                   1624: out in a single operation.
                   1625: 
                   1626: At this point, however, memory is in a bit of a mess; some objects have
                   1627: been forwarded and others have not.  One solution is just to finish the
                   1628: garbage collection: each of the ``normal'' roots of the data is
                   1629: forwarded, and then (again) all the remaining unforwarded pointers in
                   1630: {\em tospace} are forwarded.  Then memory is again consistent, and execution
                   1631: may continue.
                   1632: 
                   1633: This is unattractive because the time required to do a structured write
                   1634: is now proportional to the amount of all live data, not proportional
                   1635: to the amount written.  So a different solution is better:  keep
                   1636: a separate list of
                   1637: repair\index{repair}
                   1638: information about the {\em fromspace} objects that were overwritten with
                   1639: forwarding pointers.  Whenever a {\em fromspace} object is forwarded,
                   1640: two words of information must be kept about it:  the location of the
                   1641: object itself, and
                   1642: the previous value
                   1643: of its first word (that was overwritten).  Then, repair is easy;
                   1644: the list of repair information can be traversed, and each {\em fromspace}
                   1645: object is restored to its original state.
                   1646: The descriptor can be recovered by copying the descriptor of the
                   1647: {\em tospace} copy (accessed via the forwarding pointer), and the first
                   1648: word (which had been overwritten by the forwarding pointer) can then
                   1649: be recovered from the repair information.
                   1650: Saving the repair information is fast, and performing the repairs is also
                   1651: fast.  The time to do a structured write is now proportional (with a small
                   1652: constant of proportionality!) to the amount written.
                   1653: 
                   1654: There's one last problem.  Where do we keep the repair information, and
                   1655: what if it grows too large?  The repair information can be kept at the
                   1656: far end of {\em tospace}.  Then, as long as the amount written is fairly small,
                   1657: we'll never run into it.  And if we do run into it, that means that
                   1658: the amount written must be at least half the size of {\em tospace} (where {\em tospace}
                   1659: is always at least the size of the total accessible data).  What we can do
                   1660: in this case is just to toss away the repair information, and go ahead
                   1661: with a garbage collection!
                   1662: 
                   1663: \begin{quotation}
                   1664: {\small In a multi-generation system (as in SML-NJ's runtime system),
                   1665: we do a (cheap) minor collection before the structured write; the collection
                   1666: referred to in the paragraphs above is a major collection.
                   1667: }
                   1668: \end{quotation}
                   1669: \section{Buffered input/output}
                   1670: \label{buffered}
                   1671: 
                   1672: Some operating systems (like Unix) don't provide buffered\index{buffered I/O}
                   1673: input and output
                   1674: as a primitive, and even in operating systems that do, the overhead
                   1675: of one system call per character may be too high.  What the runtime
                   1676: system can do is provide this service for user programs.
                   1677: 
                   1678: The Unix/C\index{C} standard library\cite{kernighan78} demonstrates that
                   1679: \index{standard library}
                   1680: the runtime system need not provide this service; instead, library
                   1681: routines can implement buffered I/O.  We have taken the same approach
                   1682: in our system; the buffered I/O functions are implemented in ML as library
                   1683: functions, and the runtime system is completely ignorant of buffered I/O.
                   1684: 
                   1685: 
                   1686: \section{Execution profiling}
                   1687: \label{execution}
                   1688: 
                   1689: \index{profiling}
                   1690: In large software systems it can become difficult to tell where
                   1691: the inefficiencies are.  Execution profilers can help with this process.
                   1692: A typical profiler might provided information about the amount of time
                   1693: spent in each procedure and the number of times it was called.
                   1694: 
                   1695: Standard ML of New Jersey has an execution profiler\cite{appel88:prof},
                   1696: which will only briefly be described here.  Call counts are handled
                   1697: completely independently of the runtime system; the compiler inserts
                   1698: program variables that are incremented on entry to procedures, and the
                   1699: profiler examines these variables to generate call-count information.
                   1700: 
                   1701: Execution-time estimates are more difficult.  The Unix \verb"prof" system
                   1702: call starts a timer interrupt in the operating system that periodically
                   1703: samples the program's program counter.  An affine function is applied
                   1704: to the PC and the corresponding element of a ``histogram'' array is 
                   1705: incremented.  Then the histogram can be compared against the object
                   1706: file to see how many interrupts occurred in each procedure, which gives
                   1707: a good estimate of actual time spend in each procedure.
                   1708: 
                   1709: With a garbage collector that moves procedures around in memory, this
                   1710: method becomes unwieldy.  Instead, the runtime system has a variable
                   1711: \verb"current" that points at a sample-count cell for the currently
                   1712: executing procedure.  Whenever (profiled) code begins executing a procedure
                   1713: (or returns from a call), it assigns the address of the new procedure's 
                   1714: \index{current}
                   1715: sample-count variable into \verb"current" (the compiler inserts code to do
                   1716: this).  The timer interrupt, instead
                   1717: of examining the PC, just increments the cell pointed to by \verb"current".
                   1718: 
                   1719: This very simple mechanism does not greatly complicate the runtime system,
                   1720: but provides profiling information with a relatively low overhead.
                   1721: 
                   1722: \section{Debugging}
                   1723: \label{debugging}
                   1724: 
                   1725: We are currently developing a high-level
                   1726: debugger\index{debugger} based on the same principles as our
                   1727: profiler---namely, minimal interaction with the runtime system.
                   1728: The work is still in its early stages, but we believe that
                   1729: it is possible to make a debugger this way, and that the runtime system
                   1730: won't have to grow much in size to support it.  A detailed description
                   1731: of our plans is beyond the scope of this paper.
                   1732: 
                   1733: \section{Handling of interrupts and exceptions}
                   1734: \label{handling}
                   1735: 
                   1736: The ML language has an exception\index{exception}-handling
                   1737:  mechanism with dynamically-nested
                   1738: handlers.  Some exceptions are raised by the program itself, some are related
                   1739: to synchronous hardware exceptions (like floating point errors), and
                   1740: some are generated asynchronously (like the \verb"Interrupt" exception
                   1741: raised when the user presses the interrupt key).
                   1742: 
                   1743: The ML program has a ``current-exception-handler'' register that just
                   1744: contains a pointer to a continuation\index{continuation}
                   1745: (i.e. a closure).
                   1746: Raising an exception just
                   1747: corresponds to invoking this continuation with the exception-object
                   1748: as an argument.  Dynamic nesting of handlers is handled completely by the
                   1749: compiler: it generates code that saves the previous handler (in the
                   1750: new handler's closure) and restores it upon exit from the scope of the
                   1751: new handler.
                   1752: 
                   1753: This makes it particularly easy for the programming language's exception
                   1754: mechanism to be attached to the hardware's notion of fault or interrupt.
                   1755: When a fault or interrupt\index{interrupt}
                   1756: occurs, the operating system calls a special
                   1757: (assembly-language)
                   1758: function in the runtime system.  This function then determines what
                   1759: caused the fault, clears it if necessary, then {\em does not} return to the
                   1760: operating system but just invokes the current-exception-handler continuation
                   1761: on the appropriate exception object.
                   1762: 
                   1763: \section{Fancier garbage collectors and object boundaries}
                   1764: \label{fancier}
                   1765: 
                   1766: Our current system uses a good, simple generational garbage
                   1767: collector\cite{appel89:sggc}.  Generation garbage
                   1768: collection\cite{lieberman83}\cite{ungar84} is based on the observations
                   1769: that newer records tend to become garbage more quickly, so that the
                   1770: collector should concentrate its effort on the newer records; and that
                   1771: newer records tend to point to older records, but not vice versa, so that
                   1772: the older records need not be searched for roots into the newer area.
                   1773: 
                   1774: Sometimes, however, older records are modified (by a \verb"rplaca" operation
                   1775: in Lisp or an assignment operation in ML)
                   1776: to point to newer ones.
                   1777: Generational garbage collectors require
                   1778: the maintenance of a list of all modified records.
                   1779: To ensure portability,
                   1780: we use a very simple scheme to maintain this list: the compiler
                   1781: inserts instructions after each \verb"update" operation to put the
                   1782: address of the updated cell on a special list for the garbage collector
                   1783: to use.
                   1784: 
                   1785: \index{page fault}\index{virtual memory}\index{garbage collector}
                   1786: A clever trick is to use the virtual memory system in maintaining this list.
                   1787: It is only updates to records in older generations that must be remembered;
                   1788: we can just make all the memory occupied  by older generations read-only,
                   1789: and an update will cause a page fault.  The fault-handler can then put the
                   1790: page on a list of updated pages, and mark the page as writeable.
                   1791: During garbage collection, the entire page can be searched for
                   1792: roots into the newer region.
                   1793: 
                   1794: A different virtual-memory trick can be used to achieve concurrent garbage
                   1795: collection\cite{appel88:gc}.  By making the pages of the {\em tospace} inaccessible
                   1796: until they have been scanned and forwarded, we can allow the ML program
                   1797: to continue executing even while the garbage collector is running.
                   1798: An access to an unscanned page of {\em tospace} causes a page fault, and the
                   1799: fault handler scans that page and makes it accessible.
                   1800: Then the latency is bounded by the product of the page size and the
                   1801: record size (in practice, to just a few milliseconds).
                   1802: This is very desirable in a real-time or interactive system.
                   1803: 
                   1804: Both of these algorithms require that pages be scanned (and forwarded)
                   1805: out of order.  When objects cross page boundaries, it is difficult
                   1806: to know where the first object on a page begins, so that its descriptor
                   1807: may be found.  There are solutions to this problem that involve keeping
                   1808: track of objects that cross page boundaries\cite{appel88:gc}, but the
                   1809: SML-NJ runtime system has a runtime data format that allows a very
                   1810: simple solution.
                   1811: 
                   1812: Suppose we need to scan a particular page (to forward all its pointers).
                   1813: When we start at the beginning of the page, we don't know whether
                   1814: we are in the middle of a record or of a string; and we don't know
                   1815: where the descriptor for the object is.
                   1816: But if we knew that the page contained only records, then we wouldn't need
                   1817: to find the descriptor;
                   1818: the important thing about records is that their boundaries can be ignored.
                   1819: If a page consists of the last part of one record, followed by several
                   1820: complete records, followed by the first part of the last record, the
                   1821: pointers in all of those records can be scanned and forwarded
                   1822: without regard to the boundaries.  Since the descriptors are unboxed
                   1823: (they look like integers), they won't be touched; and all the other
                   1824: fields are tagged pointers or integers which will be forwarded (or not)
                   1825: appropriately.  (See figure~\ref{boundaries} for an illustration.)
                   1826: \begin{figure}[htbp]
                   1827: \label{boundaries}
                   1828: \begin{verbatim}
                   1829:      +------+--object boundary
                   1830:      | desc |
                   1831:      +------+
                   1832:      |  o------------>
                   1833: -----+------+-----page boundary
                   1834:      |  5   |
                   1835:      +------+--object boundary
                   1836:      | desc |
                   1837:      +------+
                   1838:      |  o------------>
                   1839:      +------+
                   1840:      |  o------------>
                   1841:      +------+--object boundary
                   1842:      | desc |
                   1843:      +------+
                   1844:      |  7   |
                   1845:      +------+
                   1846:      |  o------------>
                   1847: -----+------+-----page boundary
                   1848:      |  6   |
                   1849:      +------+
                   1850:      |  o------------>
                   1851:      +------+--object boundary
                   1852: \end{verbatim}
                   1853: \caption{Object boundaries may be ignored by the forwarding algorithm}
                   1854: \end{figure}
                   1855: 
                   1856: And if we knew that the page contained only strings, then it wouldn't
                   1857: need to be scanned at all, since strings contain no pointers.
                   1858: 
                   1859: Thus, a simple arrangement that's useful when pages need to be scanned
                   1860: in arbitrary order is
                   1861: to segregate pointer-containing objects (records and arrays) from
                   1862: non-pointer-containing objects (strings and byte-arrays).  This isn't necessary
                   1863: for the newly-allocated objects; the compiler can continue to maintain
                   1864: just one register that points at the beginning of the free space.  But
                   1865: the copying garbage collector (the functions \verb"gc" and \verb"forward")
                   1866: must copy records and strings into different parts of memory.
                   1867: This greatly simplifies the implementation of page-based garbage collection
                   1868: algorithms.
                   1869: 
                   1870: \section{Size of the SML-NJ runtime system}
                   1871: \label{size}
                   1872: 
                   1873: Our runtime system is implemented in 1548 lines of C and 232 lines of
                   1874: assembly language (the assembly language must be duplicated for each
                   1875: target architecture).  This is small indeed, considering the functionality
                   1876: of our Standard ML system.  We have made it small by moving as much as possible
                   1877: out of the runtime system, and by having a simple format for ML data.
                   1878: 
                   1879: This 1780 lines of code can be divided approximately as follows:
                   1880: \begin{enumerate}
                   1881: \item[291] lines of C to initialize and link the Standard ML library
                   1882: and loader, which then loads the rest of the system.
                   1883: {\tiny (run.c)}
                   1884: \item[299] lines to implement the copying garbage collector with repairs.
                   1885: {\tiny (gc.c)}
                   1886: \item[55] lines of C to implement the simulated process-switching required
                   1887: for garbage-collector access to ML registers.
                   1888: {\tiny (callgc.c)}
                   1889: \item[400] lines for the management of generational garbage collection,
                   1890: and for deciding when to ask the operating system for more memory.
                   1891: {\tiny (callgc.c)}
                   1892: \item[304] lines to implement C-language functions and data structures
                   1893: used as primitives by the ML program.
                   1894: {\tiny (cstruct.c)}
                   1895: \item[117] lines of C to implement the suspending of process states into
                   1896: Unix executable files.
                   1897: {\tiny (export.c)}
                   1898: \item[72] lines of utility functions for C functions that manipulate ML
                   1899: objects.
                   1900: {\tiny (objects.c)}
                   1901: \item[38] lines of assembly language to handle simulated process switching.
                   1902: {\tiny (*.prim.s)}
                   1903: \item[195] lines of assembly language to implement primitive functions
                   1904: callable by ML programs.
                   1905: {\tiny (*.prim.s)}
                   1906: \end{enumerate}
                   1907: For comparison, the Icon runtime system\cite{griswold86} (excluding
                   1908: its interpreter) is 18,000 lines of C; the T3\cite{kranz86} runtime system is
                   1909: 1,900(?) lines of C and assembly language; the FranzLisp runtime system
                   1910: (including interpreter)
                   1911: is 19,500 lines of C; the Poly/ML runtime system is ??? lines of ???.
                   1912: 
                   1913: \section{Conclusion}
                   1914: \label{conclusion}
                   1915: 
                   1916: It's easy to implement a hi-tech runtime system if you do it cleanly.
                   1917: 
                   1918: \section{Acknowlegements}
                   1919: 
                   1920: David MacQueen, Trevor Jim, and Bruce Duba have all worked on the runtime
                   1921: system, and their assistance has been valuable.  Norman Ramsey provided
                   1922: many useful suggestions about the organization of this paper.
                   1923: 
                   1924: %\cleardoublepage
                   1925: \bibliographystyle{plain}
                   1926: \bibliography{appel}
                   1927: %\cleardoublepage
                   1928: %\printindex
                   1929: \end{document}
unix.superglobalmegacorp.com
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.