|
|
1.1 root 1: .TR 84-11
2: .DA "August 5, 1984"
3: .Gr
4: .TL
5: A Tour Through the C Implementation of Icon; Version 5.9
6: .AU
7: Ralph E. Griswold
8: .AU
9: Robert K. McConeghy
10: .AU
11: William H. Mitchell
12: .AB
13: This report documents the C implementation
14: of Version 5.9 of the Icon programming language.
15: This report concentrates on the
16: major parts of the system \(em
17: the translator, the linker, and the interpreter.
18: An additional section discusses how the implementation
19: can be modified for new language features.
20: .AE
21: .SH
22: Introduction
23: .PP
24: This report describes the C implementation of Version 5.9 of
25: the Icon programming language [1].
26: Most of the system is coded in C [2] and is designed to be run under \*U.
27: In addition to the C portion of the system, there is some assembly
28: language code.
29: To date, the C implementation has been adapted to the
30: PDP-11\u\(dg\d, VAX-11, and Onyx C8002.
31: .Un
32: .FS
33: \u\(dg\dPDP and VAX are trademarks of Digital Equipment Corporation.
34: .FE
35: This implementation is intended to be portable
36: to other computers running under UNIX,
37: but portability was not a primary design goal.
38: Reference 3 describes the process of transporting this implementation
39: and contains detailed descriptions of the assembly language routines for
40: the VAX implementation.
41: .PP
42: The implementation of the Icon system consists of three parts:
43: a translator, a linker,
44: and an interpreter. The interpreter contains a run-time system that
45: includes routines for the operations that are needed to execute an Icon program.
46: The translator converts an Icon source program into
47: an intermediate code, called
48: .I ucode .
49: The linker combines separately translated ucode files,
50: binds inter-procedure references,
51: and produces interpretable binary output, called \fIicode\fR.
52: .PP
53: The reference language for this report is Version 5.9 of Icon [4].
54: This report is intended to be used in conjunction with
55: the source listings for Version 5.9,
56: although a general overview of the system
57: can be obtained from this document alone.
58: .NH
59: The Translator
60: .PP
61: The Icon translator is written entirely in C
62: and consists of
63: 12 files of source code and 10 header files.
64: The translator builds a parse tree for each Icon procedure,
65: then traverses the tree to generate code.
66: Three of the 12 source files contain only data initialization
67: and are automatically generated from specification files.
68: In addition, the LALR(1) parser is automatically generated
69: by the \fIYacc\fR parser generator [5].
70: .PP
71: The ucode output from the translator consists of two files.
72: One file, with the suffix \fM.u1\fR, contains
73: intermediate code corresponding to the procedures in the program.
74: The second file, with the suffix \fM.u2\fR,
75: contains global symbol table information.
76: Both files are printable.
77: .PP
78: The following sections discuss the four parts of the translator:
79: the lexical analyzer,
80: the parser,
81: the code generator,
82: and the symbol table manager.
83: .NH 2
84: The Lexical Analyzer
85: .PP
86: The lexical analyzer reads the Icon source program,
87: breaks it into tokens,
88: and delivers the tokens to the parser as requested.
89: A token is the basic syntactic unit of the Icon language;
90: it may be an identifier, a literal, a reserved word,
91: or an operator (operators include punctuation).
92: .PP
93: The lexical analyzer consists of four source files:
94: \fMlex.c\fR,
95: \fMchar.c\fR,
96: \fMoptab.c\fR,
97: and
98: \fMtoktab.c\fR.
99: The latter two of these files contain operator and token tables, respectively,
100: and are automatically generated
101: from operator and token specification files,
102: described below.
103: The file
104: \fMchar.c\fR
105: contains character mapping tables and the file
106: \fMlex.c\fR
107: contains the lexical analyzer itself.
108: .PP
109: The parser requests a token by calling
110: \fMyylex\fR,
111: which finds the next token in the source program
112: and determines its token type and value.
113: The parser bases its moves on the token type:
114: if the token is an operator or reserved word,
115: the token type specifically identifies the operator or reserved word;
116: otherwise, the token type indicates one of the six \*(oqprimitive\*(cq types:
117: identifier, integer literal, real literal, string literal, cset literal, or end-of-file.
118: The token value is a leaf node of the parse tree,
119: which, for the primitive types,
120: contains the source program representation of the token.
121: The token value node also contains
122: the source-program line and column numbers where the token starts.
123: A pointer to this node is placed in the global variable
124: \fMyychar\fR,
125: and
126: \fMyylex\fR
127: returns the token type.
128: .PP
129: The lexical analyzer finds the next token
130: by skipping white space, including comments.
131: The first character of the new token indicates which of the classes it belongs to.
132: A letter or underscore begins an identifier or reserved word,
133: a digit begins an integer or real literal,
134: a double quote begins a string literal,
135: a single quote begins a cset literal,
136: and any other character is assumed to begin an operator.
137: An identifier or reserved word is completed
138: by gathering all subsequent letters, digits, and underscores.
139: The reserved word table is consulted to determine if the token
140: is an identifier or a reserved word.
141: Numeric literals are recognized by a finite-state automaton,
142: which distinguishes real from integer literals by the
143: presence of a decimal point or the letter \*(oqe\*(cq.
144: A quoted literal is completed by reading
145: until the opening delimiter is repeated,
146: converting escapes in the process and continuing to new lines as necessary.
147: A table-driven finite-state automaton, described below,
148: recognizes operators.
149: .PP
150: An important task of the lexical analyzer is semicolon insertion.
151: The grammar requires that semicolons separate expressions
152: in a compound expression or procedure body,
153: so they must be inserted into the token stream
154: where they are omitted in the source program.
155: This process is table driven.
156: Associated with each token type are two flags,
157: .I BEGINNER
158: and
159: .I ENDER .
160: The
161: .I BEGINNER
162: flag is true if a token may legally begin an expression
163: (i.e., if it may follow a semicolon).
164: Similarly, the
165: .I ENDER
166: flag is true if a token may legally end an expression
167: (i.e., if it may precede a semicolon).
168: When a newline appears between two tokens, the
169: .I ENDER
170: flag of the first is true, and the
171: .I BEGINNER
172: flag of the second is true,
173: then a semicolon is inserted between the two tokens.
174: .PP
175: The token table is initialized in the file
176: \fMtoktab.c\fR.
177: The table is divided into three sections:
178: primitive types, reserved words, and operators.
179: The primitive types are fixed in the first six
180: slots in the table,
181: and must not be changed,
182: since they are referenced directly from the code.
183: The reserved words follow
184: and must be in alphabetical order.
185: The operators follow in no special order.
186: The last entry merely marks the end of the table.
187: .PP
188: Also in
189: \fMtoktab.c\fR
190: is an index to reserved words.
191: To speed up the search for reserved words,
192: this table hashes the search
193: using the first letter as the hash value.
194: The reserved words that begin with that letter then are examined linearly.
195: .PP
196: The operator table, in
197: \fMoptab.c\fR,
198: describes a finite-state automaton that
199: recognizes each operator in the language.
200: Each state is represented by an array of structures.
201: Each structure in the array
202: corresponds to a transition on the input symbol.
203: The structure contains three fields:
204: an input symbol,
205: an action,
206: and a value used by the action.
207: The recognizer starts in state 0;
208: the current input symbol is the first character
209: of the operator.
210: In a given state with a given input symbol,
211: the recognizer searches the array associated with the current state
212: for an entry that matches the current input symbol.
213: Failing a match, the last entry of the array, with
214: the input symbol field of 0,
215: is used.
216: The recognizer then performs one of the following actions,
217: depending on the value of the action field:
218: .in .5i
219: .IP \(bu \w'\(bu'u+1m
220: goes to the new state indicated by the value field
221: and gets the next input character
222: .IP \(bu
223: issues an error
224: .IP \(bu
225: returns the value field as
226: a pointer to the token table entry for the operator
227: .IP \(bu
228: returns the value field,
229: but pushes the current input character back onto the input.
230: .in 0
231: .LP
232: The difference between the last two actions is that
233: some operators are recognized immediately (e.g., \*(oq\fM;\fR\*(cq),
234: while others are not recognized until the character
235: following the operator is read (e.g., \*(oq\fM=\fR\*(cq).
236: .PP
237: The token table
238: and operator table
239: are automatically constructed by the Icon program
240: \fMmktoktab.icn\fR.
241: This program reads the specification file
242: \fMtokens\fR
243: and builds the file
244: \fMtoktab.c\fR.
245: The file
246: \fMtokens\fR
247: contains a list of all the tokens,
248: their token types (given as defined constants),
249: and any associated flags.
250: This list is divided into the three sections detailed above.
251: The program then reads the specification file
252: \fMoptab\fR
253: and builds the file
254: \fMoptab.c\fR.
255: The former is a skeleton for the operator table;
256: it contains the state tables,
257: but the program fills in the pointers to the token table entries.
258: .NH 2
259: The Parser
260: .PP
261: The parser, in the file
262: \fMparse.c\fR,
263: is automatically generated by \fIYacc\fR.
264: The grammar and semantic actions are contained in the file
265: \fMicon.g\fR.
266: From these specifications,
267: \fIYacc\fR generates parser tables for an LALR(1) parser.
268: .PP
269: In addition to the grammar,
270: \fMicon.g\fR
271: contains
272: a list of all the token types in the language
273: and declarations necessary to the actions.
274: \fIYacc\fR assigns an integer value to each token type,
275: and generates define statements,
276: which are written to the file
277: \fMtoken.h\fR.
278: These defined constants are the token types
279: returned by the lexical analyzer.
280: .PP
281: The grammar is context-free,
282: with actions associated with most of the rules.
283: An action is invoked when the corresponding rule is reduced.
284: The actions perform two duties:
285: maintaining the symbol tables
286: and constructing the parse tree.
287: The parse tree is built from the bottom up \(em
288: the leaves are supplied by the lexical analyzer
289: and the actions build trees from the leaves and from smaller trees
290: with each reduction.
291: .PP
292: The parser requests tokens from the lexical analyzer,
293: building a parse tree until it reduces a procedure.
294: At this point, it passes the root of the parse tree
295: to the code generator.
296: Once the intermediate code has been generated,
297: the parse tree is discarded,
298: and a new tree is begun for the next procedure.
299: .PP
300: Record and global declarations
301: affect only the symbol table
302: and do not generate parse trees.
303: Files named in link directives produce link instructions in the ucode
304: output.
305: .PP
306: A complete parse tree is rooted at a
307: \f3proc\fR
308: node, which identifies the procedure
309: and points to the subtrees for the
310: \fMinitial\fR
311: clause (if any)
312: and the body of the procedure.
313: Each node in the parse tree represents a source program construction
314: or some implicit semantic action.
315: A node can contain up to six fields,
316: the first of which is the node type.
317: The second and third fields are always line and column numbers
318: that are used for error messages and tracing.
319: Any additional fields contain information about the construction,
320: and possibly pointers to subtrees.
321: Appendix A contains a description of all the node types.
322: .PP
323: The grammar, shown in Appendix B, has several ambiguities.
324: The well-known \*(oqdangling else\*(cq problem exists
325: not only in the
326: \fMif-then-else\fR
327: expression, but also in the
328: \fMwhile-do\fR,
329: \fMuntil-do\fR,
330: \fMevery-do\fR,
331: and \fMto-by\fR
332: expressions.
333: In each of these expressions, the last clause is optional,
334: so that when the parser sees an
335: \fMelse\fR,
336: for example,
337: it does not know whether to shift the token
338: (associating it with the most recent
339: \fMif\fR),
340: or to reduce the preceding
341: \fMif-then\fR
342: expression (leaving the else \*(oqdangling\*(cq).
343: The latter choice is obviously incorrect, since the
344: \fMelse\fR
345: would never be shifted, and \fIYacc\fR correctly resolves such conflicts
346: in favor of the shift.
347: Thus, each
348: \fMelse\fR
349: is paired with the most recent unpaired
350: \fMif\fR.
351: All the control structures (except
352: \fMcase\fR)
353: have an additional ambiguity:
354: they do not have a closed syntax,
355: yet they may appear in an expression at the highest precedence level.
356: For example, the expression
357: .DS
358: .tr -\-+\(pl*\(**
359: \fMx := y + if a = b then z else -z * 3
360: .tr --++**
361: .ft R
362: .DE
363: could parse in either of two ways:
364: .DS
365: .tr -\-+\(pl*\(**
366: \fMx := y + (if a = b then z else (-z * 3))
367: x := y + (if a = b then z else -z) * 3
368: .tr --++**
369: .ft R
370: .DE
371: This problem, too, is resolved in favor of the shift,
372: such that the first parse is always used.
373: Thus, in the absence of parentheses,
374: the entire expression to the right of a control structure
375: is part of the control structure.
376: .PP
377: Little attention has been paid to error recovery.
378: A few error productions have been placed in the grammar
379: to enable \fIYacc\fR to recover from syntax errors;
380: the technique for doing so is described by Aho and Johnson [6].
381: The parser is slightly modified by the editor script
382: \fMpscript\fR
383: so that the parser state is passed to the routine
384: \fMyyerror\fR.
385: This routine prints an error message from the file
386: \fMsynerr.h\fR
387: that is associated with the current parser state.
388: This error table currently is constructed by
389: hand from the \fMy.output\fR file obtained by
390: running \fIYacc\fR with the
391: \f3\-v\fR
392: option.
393: .NH 2
394: The Code Generator
395: .PP
396: The parser calls the code generator upon
397: recognition of each Icon procedure,
398: giving it the root of the parse tree.
399: The code generator traverses the parse tree recursively,
400: emitting ucode.
401: Appendix C contains a description of ucode.
402: .PP
403: The file
404: \fMcode.c\fR
405: contains both the tree node allocation and the code generation routines.
406: The included header file
407: \fMcode.h\fR
408: contains macros and definitions needed by the code generator,
409: while
410: \fMtree.h\fR
411: defines the tree nodes and the macros that allocate them.
412: The macros in
413: \fMtree.h\fR
414: provide the interface between the parser and the code generator.
415: .PP
416: The tree traversal routine,
417: \fMtraverse\fR,
418: is a recursive procedure with one argument,
419: a pointer to the root of a tree or subtree
420: for which code is to be generated.
421: The routine examines the type field of the root
422: and, through a switch statement,
423: generates a sequence of ucode instructions as determined by the type.
424: If the node has subtrees,
425: \fMtraverse\fR
426: calls itself recursively at the appropriate point
427: to generate code for the subtree.
428: For example, the code generated for a binary operator
429: first generates code for its two subexpressions,
430: then emits the code that calls the appropriate run-time library routine.
431: .PP
432: The returned value of the traversal routine is used for counting
433: elements of expression lists.
434: If the root of the tree being traversed is an
435: \f3elist\fR
436: (expression list) node,
437: \fMtraverse\fR
438: returns the sum of the returned values of its two subtrees.
439: Otherwise, it returns 1.
440: This count is used when generating code for
441: procedure calls and lists with explicit elements,
442: which need to know the number of arguments
443: to be pushed onto the stack.
444: .PP
445: When generating code for loops,
446: the code generator needs to save three pieces of information
447: for each nested loop:
448: the
449: .I "break label" ,
450: the
451: .I "next label" ,
452: and the expression nesting level.
453: This information is kept on the
454: .I "loop stack" .
455: The break label is a label placed just past the end of the loop;
456: it is the place where control is passed when the loop is finished.
457: The next label is placed near the end of the loop,
458: at a point where the next iteration of the loop can be started.
459: The code for
460: \fMbreak\fR
461: and
462: \fMnext\fR
463: expressions branches to these labels,
464: but in either case, any incomplete expression frames (see Section 3.2)
465: within the loop must first be popped from the stack.
466: The expression nesting level counts the number of currently active
467: expression frames within the current loop;
468: an
469: \f3unmark\fR
470: instruction is generated for that many expression frames
471: (less one for a
472: \fMnext\fR
473: expression).
474: .PP
475: The possibility of nested
476: \fMcase\fR
477: expressions requires that certain information be kept on a
478: .I "case stack" .
479: For each case expression,
480: the code generator allocates a label for the end of the expression
481: and pushes it onto the case stack.
482: When a
483: \fMdefault\fR
484: clause is encountered,
485: its subtree is placed on the top of the case stack to delay
486: code generation for it until the end of the \fMcase\fR expression.
487: .NH 2
488: The Symbol Table Manager
489: .PP
490: The symbol table manager consists of the symbol table data structures
491: and routines that operate upon these data structures.
492: The source code for the symbol table manager is contained in two files.
493: The file
494: \fMkeyword.c\fR
495: contains only the keyword name table and is automatically constructed from a
496: keyword specification file discussed below.
497: The remainder of the symbol table manager is located in the file
498: \fMsym.c\fR.
499: .PP
500: The symbol table manager operates with two logical data structures,
501: the symbol table proper and the string space.
502: When the lexical analyzer identifies a token as either
503: an identifier or a literal,
504: the lexical analyzer requests the symbol table manager
505: to enter the token into the string space.
506: The symbol table manager returns a pointer into
507: the string space for that string.
508: The lexical analyzer then places this pointer in the token value node.
509: To help keep the size of the string space small,
510: all entries are hashed,
511: and only one copy of any string is kept.
512: This has the added benefit that two strings can be compared
513: by checking only the pointers into the string space.
514: .PP
515: The parser determines the context of the token
516: and requests the symbol table manager
517: to enter the token into the symbol table proper.
518: It is the responsibility of the symbol table manager
519: to verify that the use of the token is consistent with prior use.
520: Appropriate diagnostics are issued if the use is inconsistent.
521: .PP
522: The symbol table proper is physically divided into three separate structures:
523: the
524: .I global ,
525: .I local ,
526: and
527: .I literal
528: tables.
529: Each of these tables is hashed,
530: using the pointer into the string space as the key.
531: Since this pointer is an offset into the string space,
532: hashing is simply and effectively performed by taking the rightmost
533: .I n
534: bits of the offset
535: (where
536: .if n .I n "" 2**
537: .if t 2\s-2\u\fIn\fR\d\s+2
538: is the size of the hash vector for the table).
539: .PP
540: The global table contains identifiers that have been declared as
541: globals, procedures, or records.
542: The local table holds all identifiers declared as locals,
543: formal parameters for procedure declarations, field names for record
544: declarations, and all other identifiers referenced in the procedure
545: (including those declared as global elsewhere).
546: The literal table contains entries for literal strings and csets, integers,
547: and floating-point constants.
548: .PP
549: Both the local and literal tables are associated with the current
550: procedure being parsed
551: and are written to the \fM.u1\fR file
552: when the procedure has been successfully parsed.
553: If a record declaration has been parsed, then the local table,
554: containing only the field name identifiers, is written to the
555: \&\fM.u2\fR file.
556: After all procedure, record, and global declarations in a Icon source
557: file have been parsed, the global table is written into the global
558: declarations file.
559: .PP
560: An entry into any of the three symbol table sections is a structure with
561: three fields:
562: a link,
563: a name,
564: and a flag.
565: The link field holds the pointer to the next entry in the same hash bucket.
566: The name is the pointer to the identifier or literal name in the string space.
567: The flag field contains the type
568: .I "formal parameter" , (
569: .I "static local" ,
570: .I "procedure name" ,
571: etc.) of the entry.
572: Global table entries have a fourth field,
573: an integer providing the number of formal parameters for a procedure declaration,
574: or the number of fields in a record declaration.
575: .PP
576: Lookup in the local and global tables is merely the process of
577: following a hash chain until an entry of the same name is found or until
578: the hash chain is exhausted.
579: If a previous entry is found, the flags of the existing and new entries
580: are compared, and diagnostics are printed if the use of the new entry
581: conflicts with the previous usage.
582: The new entry is ignored whenever such an inconsistency is found.
583: .PP
584: The literal table uses the same lookup procedure, except
585: the search down the hash chain stops when an entry is found
586: with the same textual form and flag fields.
587: Thus the string literal
588: \fM"123"\fR
589: and the integer literal
590: \fM123\fR
591: have separate entries in the literal table, even though
592: they have the same string representations.
593: A consequence of this technique is that the integer
594: literals
595: \fM123\fR
596: and
597: \fM0123\fR
598: have separate entries in the literal table,
599: even though they have the same numeric value.
600: Since most programmers use a reasonably consistent style
601: when expressing literals,
602: this technique usually does not produce many
603: duplicate constants.
604: .PP
605: .tr ##
606: A final task of the symbol table manager is
607: the identification of keyword names.
608: (Note that keywords are of the form \fM&\fIname\fR.)
609: The symbol table manager maintains a list of the legal keyword names
610: and, upon request, returns a numeric identification for a keyword name
611: to the parser.
612: An automatic procedure exists for creating the keyword name table:
613: the Icon program
614: \fMmkkeytab.icn\fR
615: reads the specification file
616: \fMkeywords\fR
617: and produces the keyword name table in
618: \fMkeyword.c\fR.
619: The file
620: \fMkeywords\fR
621: is simply a list of the keyword names and a numeric identification for each.
622: Since the number of keyword names is small, and only a few
623: references to keywords are typical in an Icon program, lookup in the
624: keyword name table is done using a linear search.
625: .PP
626: The sizes of the respective portions of the symbol table
627: may be altered with command line arguments to the Icon translator.
628: .NH
629: Linker
630: .PP
631: The Icon linker is written entirely in C. It consists of eight files of
632: source code and three header files.
633: The linker performs three tasks:
634: combining the global symbol tables from one or more
635: runs of the translator,
636: resolving undeclared identifiers,
637: and translating ucode to icode.
638: The resulting combined global symbol table is used
639: for determining the scope of undeclared identifiers
640: during the second task.
641: The second and third tasks are done during a single pass
642: over each intermediate code file.
643: A single file of assembly code is produced.
644: .PP
645: The symbol table module, in the file
646: \fMlsym.c\fR,
647: is similar to the symbol table module of the translator,
648: except that there is an additional table for storing
649: field names of records.
650: The input module, in the file
651: \fMllex.c\fR,
652: recognizes the instructions in both the global symbol table files
653: and the intermediate code files.
654: The global symbol tables are merged by the routine \fMglobals\fR in
655: \fMglob.c\fR,
656: and the intermediate code files are produced by the routines in
657: \fMlcode.c\fR.
658: Of the remaining source files,
659: \fMilink.c\fR
660: and
661: \fMlmem.c\fR
662: contain the main program, miscellaneous support routines,
663: and memory initialization.
664: The files
665: \fMbuiltin.c\fR
666: and
667: \fMopcode.c\fR
668: contain table initializations for the list of built-in procedures
669: (functions)
670: and the ucode operations, respectively.
671: .PP
672: The first phase of the linker reads
673: the global symbol table file from each translator run,
674: and entering all the global symbols into one combined table.
675: The format of a global symbol table file is described in Appendix C.
676: This phase also builds the record/field table that cross-references
677: records and field names,
678: and sets the trace flag for execution-time tracing
679: if any of the files being linked were translated with the
680: \f3\-t\fR
681: option.
682: .PP
683: As records are entered into the global symbol table
684: and the record/field table,
685: they are numbered, starting from 1.
686: These record numbers are used to index the record/field table
687: at run-time when referencing a field.
688: .PP
689: When the linker encounters a link instruction, the named file is added
690: to the end of a linked list of files to be linked. The list initially
691: consists of the files named as arguments. Names are not added to the
692: list if they are already on it.
693: .PP
694: The second phase reads each intermediate code file in sequence,
695: emitting icode as each procedure is encountered.
696: Appendix C describes the intermediate code.
697: The intermediate code contains a prologue for each procedure,
698: beginning with a
699: \f3proc\fR
700: opcode, followed by a series of
701: \f3loc\fR
702: opcodes describing the local symbol table, a series of
703: \f3con\fR
704: opcodes describing the constant table, and a
705: \f3declend\fR
706: opcode terminating the prologue.
707: The local symbol table contains not only local symbols,
708: but all identifiers referenced in the procedure \(em
709: global, local, or undeclared.
710: When an undeclared identifier is entered into the local symbol table,
711: its scope is resolved by the following steps:
712: .in .5i
713: .IP \(bu \w'\(bu'u+1m
714: if the identifier has been entered in the global symbol table,
715: it is entered into the local symbol table as a global identifier
716: .IP \(bu
717: if the identifier matches the name of a function,
718: it is entered into the local symbol table as a function
719: .IP \(bu
720: otherwise it is entered as a local identifier
721: and a warning is issued if the linker was run with the
722: \f3\-u\fR
723: option
724: .in 0
725: .LP
726: The constant table contains an entry for each literal used
727: in the procedure.
728: .PP
729: The linker outputs icode in several regions.
730: The first region contains constants, procedure blocks, and code
731: for the Icon procedures.
732: The next region contains the record/field table and procedure blocks
733: for record constructors.
734: The next four regions contain
735: the global identifiers, the names of the global identifiers,
736: the static identifiers, and the identifier table.
737: The icode is a sequence of instructions,
738: each with an opcode and, in some cases, operands. The sizes of
739: opcodes and operands depend on the machine architecture and the
740: implementor's judgement. On the
741: VAX-11, opcodes are one byte long and operands are four bytes long.
742: Most instructions correspond exactly to instructions
743: in the ucode that is output by the translator.
744: The opcode values are those used internally by the linker
745: (defined in the file
746: \fMlink/opcode.h\fR).
747: .PP
748: Fields are provided in the global symbol and literal tables
749: for associating a location with each entry.
750: As the prologue is being read, each cset, real, or long-integer
751: literal entered into the literal table is output immediately
752: and its location is stored in the literal table.
753: Thus, the locations of all constants are known before their reference.
754: .PP
755: The same is true of references to procedures,
756: since these references only occur in the initialization for global identifiers,
757: which is not output until all procedures have been output.
758: When the prologue for a procedure has been completely processed,
759: the procedure data block is output, and its location is noted in the
760: global symbol table.
761: .PP
762: References to program labels require backpatching,
763: since there often are forward references.
764: Because program label references are always local to the current procedure,
765: the linker buffers the output code for a procedure.
766: A table of values for all program labels is initialized to zero at
767: the beginning of each procedure.
768: When a label is referenced and its table entry is zero,
769: the location of the reference is negated and stored in the table entry
770: and a zero is output for the operand.
771: If a label's table entry is negative,
772: the location of the reference is negated and stored in the table entry
773: as before, but the previous value of the table entry is output for the operand.
774: This forms a linked list of references to the as-yet-undefined label.
775: When a label is defined, each reference on the linked list is replaced
776: with the correct value of the label.
777: .PP
778: References to global and static identifiers are determined at run-time.
779: The
780: \f3glob\fR
781: and
782: \f3static\fR
783: instructions have an integer operand referring to the identifier by
784: position in the global or static identifier region.
785: When one of these instructions is interpreted,
786: the actual address is calculated from the position and the
787: known address of the global or static identifier region.
788: References to functions are also resolved at run-time.
789: Each function is assigned an integer index
790: (its position in the table of functions in
791: \fMbuiltin.c\fR).
792: When the global identifier initialization for a function is output,
793: the negated index is output instead of an address.
794: The interpreter fills in the correct address during program initialization.
795: .PP
796: Once the prologue has been processed,
797: a procedure data block (see Section 3.1)
798: is emitted.
799: Opcodes following the prologue represent execution-time operations,
800: and cause code to be emitted.
801: .PP
802: The record/field table is a two-dimensional matrix,
803: first indexed by a field number assigned to each identifier that
804: is used as a field name,
805: next by a record number assigned to each record type.
806: The value at the selected position in the table is the
807: index of the field in a record of the given type, or \-1
808: if the given record type does not contain the given field.
809: .PP
810: The initial value for global and static identifiers
811: is the null value unless the global identifier is a procedure,
812: function, or record constructor,
813: in which case the initial value is a descriptor of type
814: .I procedure
815: pointing to the appropriate procedure data block.
816: The values output use the data representations described
817: in Section 3.1.
818: .PP
819: The names of global and static identifiers are output as
820: .I "string qualifier"
821: descriptors (see Section 3.1)
822: and are used by the function
823: \fMdisplay\fR.
824: All string qualifiers contained in the generated procedure data blocks
825: and global and static names point into the identifier table,
826: which is just a static string space for that purpose.
827: .NH
828: The Interpreter
829: .PP
830: The interpreter consists of an interpretive loop and a collection of
831: run-time routines that
832: collectively provide support for the execution
833: of an Icon program.
834: .ig
835: This library is searched by the loader for those
836: routines necessary for a particular Icon program.
837: The assembly code generated by the linker contains
838: subroutine calls to library routines to perform most
839: high-level operations where in-line code would be inappropriate.
840: An executable program is created by assembling
841: the linker output, then loading a startup routine and
842: the assembler output with the run-time library and the C library.
843: The startup routine, run-time library, and C library together
844: form the run-time system.
845: ..
846: .PP
847: Three directories contain routines relating directly to source-language
848: operations:
849: \fMfunctions\fR,
850: \fMoperators\fR,
851: and
852: \fMlib\fR.
853: The first two directories contain
854: one routine per function or operator, respectively.
855: The
856: \fMlib\fR
857: directory contains routines relating to Icon control structures.
858: A fourth directory,
859: \fMrt\fR,
860: contains routines for performing common operations
861: needed by many routines in the other three directories.
862: In particular,
863: \fMrt\fR
864: contains routines that handle storage allocation and reclamation,
865: type conversion, data comparison,
866: integer arithmetic with overflow checking,
867: generator suspension,
868: and tracing.
869: The directory \fMiconx\fR contains initialization and the interpreter proper.
870: .PP
871: In each of the four run-time directories, all of the object files are
872: archived in a \fMLib\fR file which is randomized to speed loading.
873: The \fMLib\fR files are loaded together with a startup routine and the
874: interpretive loop to produce the interpreter.
875: .PP
876: Most of the run-time system is coded in C,
877: but some of the routines are coded in assembly language.
878: The interpretive loop and startup routines are written in assembly language, as is
879: integer arithmetic with overflow checking
880: (C does not provide this),
881: as well as other routines concerned with stack management.
882: .PP
883: The interpreter is loaded with the run-time libraries and the C
884: library to form the program \fMiconx\fR, which interprets
885: icode.
886: .PP
887: Before the interpreter begins executing the Icon program,
888: it reads in the icode file generated by the linker.
889: The first eight words of this file contain header information
890: indicating the total size of the rest of the file,
891: the initial value of
892: \fM&trace\fR,
893: and the relative offsets from the beginning of the file to the
894: various regions.
895: These offsets are converted to actual addresses by adding the base
896: address of the icode buffer.
897: Several pointers in the icode must also be relocated.
898: The interpreter sweeps through the global identifiers,
899: looking for procedures, functions, and record constructors.
900: For each function, it supplies the address of the appropriate procedure block.
901: For each procedure, it relocates pointers
902: from its procedure block to the procedure entry point,
903: as well as to strings representing the procedure
904: and local identifier names in the identifier table.
905: For each record, it supplies the address of
906: \fMmkrec\fR,
907: the routine that constructs new records,
908: as the entry point field in the procedure block.
909: .PP
910: The interpreter then begins execution by invoking the value of the first global
911: identifier, which corresponds to the procedure
912: \fMmain\fR.
913: If there is no main procedure, the first global identifier has the
914: null value and a run-time error is reported.
915: The routine
916: \fMinvoke\fR
917: sets
918: the
919: \fIinterpreter program counter \fR(\fIipc\fR)
920: to the entry point, and branches to
921: \fMinterp\fR, which is contained in \fMiconx/interp.s\fR.
922: .PP
923: The routine
924: \fMinterp\fR
925: is the main interpreter loop.
926: It fetches the next opcode,
927: and branches to the appropriate processing routine through a jump table.
928: .NH 2
929: Data Representations
930: .PP
931: Icon has two elementary forms of data objects \(em values and variables.
932: Values often can be converted from one data type to another.
933: When this is done automatically, it is called
934: .I coercion .
935: There are three kinds of variables, each discussed below:
936: .I "natural variables" ,
937: .I "created variables" ,
938: and
939: .I "trapped variables" .
940: The process of obtaining the value referred to by a variable is called
941: .I dereferencing .
942: .PP
943: Every data object is represented by a two-word
944: .I descriptor ,
945: which may, depending on the type of the object,
946: contain a value or
947: refer to some other area of memory for the actual value.
948: The first word of the descriptor always indicates the data type,
949: and the second word either contains the value or a pointer to it.
950: There are six descriptor formats, shown in Appendix D:
951: .I null ,
952: .I "string qualifier" ,
953: .I "integer" ,
954: .I value ,
955: .I variable ,
956: and
957: .I "trapped variable" .
958: These formats are distinguished from one another by the three high-order
959: bits of the first word,
960: except that a
961: .I null
962: descriptor is distinguished from a
963: .I "string qualifier"
964: only by the contents of the second word.
965: Among
966: .I "integer" ,
967: .I value ,
968: and
969: .I "trapped variable"
970: descriptors, the low-order six bits of the first word
971: identify the type of object represented,
972: while the remaining high-order bits in the first word are flags that
973: classify the object (for example, whether the second word contains
974: a pointer
975: \(em historically, a \*(oqfloating address\*(cq [7]).
976: .PP
977: The
978: .I null
979: descriptor represents the null value.
980: A
981: .I "string qualifier"
982: represents a string, and contains the length of the string
983: and a pointer to the first character of the string.
984: An
985: .I "integer"
986: descriptor represents an integer small enough
987: to fit in the second word of the descriptor.
988: This includes all integers on computers whose C \fIints\fR are the same
989: size as C \fIlongs\fR (such as the VAX-11).
990: All data types other than \f3integer\fR, \f3string\fR, and \f3null\fR are
991: represented by
992: .I value
993: descriptors.
994: A value descriptor
995: contains a pointer to a
996: .I "data block"
997: of appropriate format for a value of the given type.
998: On computers whose C \fIlongs\fR are longer than C \fIints\fR
999: (such as the PDP-11),
1000: an integer that requires more bits than there are in an \fIint\fR is contained
1001: in a \fIlong integer\fR data block.
1002: The data block formats for each data type are shown in Appendix D.
1003: .PP
1004: A
1005: .I variable
1006: descriptor represents either a natural variable
1007: or a created variable.
1008: A natural variable contains a pointer to a descriptor at a fixed location
1009: (for a global identifier) or a location on the stack (for a local identifier)
1010: where the value of the variable is stored.
1011: A created variable, formed by a table, list, or field reference,
1012: contains a pointer to a descriptor in the aggregate
1013: where the referenced element is located.
1014: Since such elements are in the heap,
1015: created variables also contain an offset that indicates
1016: the distance (in words) from the beginning of the data block
1017: to the referenced descriptor.
1018: This offset is used during the marking phase of garbage collection,
1019: discussed in Section 3.3.
1020: .PP
1021: A
1022: .I "trapped variable"
1023: [8] descriptor represents a variable for which special action is necessary
1024: upon dereferencing or assignment.
1025: Such variables include substrings,
1026: non-existent elements of tables,
1027: and certain keywords.
1028: Each type of trapped variable is distinguished by the first word
1029: of the descriptor.
1030: .PP
1031: Substring trapped variables,
1032: created by a section or subscripting operation,
1033: contain a pointer to a data block that contains a
1034: .I variable
1035: descriptor identifying the value from which the substring was taken,
1036: an integer indicating the beginning position of the substring,
1037: and an integer showing the length of the substring.
1038: With this information, assignment to a substring of a variable
1039: can modify the contents of the variable properly.
1040: Substrings of non-variables do not produce
1041: substring trapped variables, since assignment to such substrings
1042: is meaningless and illegal;
1043: instead, forming the substring of a non-variable produces a
1044: string qualifier.
1045: .PP
1046: Table element trapped variables,
1047: formed by referencing a non-existent element of a table,
1048: similarly contain a pointer to a data block that contains
1049: enough information for assignment to add the element to the
1050: referenced table or to supply the default table value.
1051: .PP
1052: The keywords
1053: \fM&pos\fR,
1054: \fM&random\fR,
1055: and
1056: \fM&trace\fR
1057: are handled via trapped variables (\fM&subject\fR is handled differently).
1058: These trapped variables
1059: need no additional information.
1060: It is sufficient to know the type of trapped variable on dereferencing \(em
1061: the value of the keyword can be accessed and returned.
1062: On assignment, the new value is coerced to the appropriate type,
1063: checked for validity, and assigned to the keyword.
1064: .PP
1065: Strings formed during program execution are placed in the
1066: .I "string space" ;
1067: string qualifiers for these strings point into this region.
1068: Substrings of existing strings are not allocated again;
1069: instead, a string qualifier is formed that points into the
1070: existing string.
1071: When storage is exhausted in the string space,
1072: the garbage collector (see Section 3.3)
1073: is invoked to reclaim unused space and compact the region;
1074: if enough space cannot be reclaimed, the region is expanded if possible.
1075: .PP
1076: Data blocks formed during program execution are placed in the
1077: .I heap .
1078: Data blocks have a rigid format dictated by the garbage collection
1079: algorithm.
1080: The first word of the block always contains a type code
1081: which identifies the structure of the rest of the block.
1082: Descriptors follow all non-descriptor information in the block.
1083: If the size of the block is not determined by its type,
1084: the size (in bytes) is contained in the second word of the block.
1085: When storage is exhausted in the heap,
1086: the garbage collector is invoked to reclaim unused space and compact the heap;
1087: if enough space cannot be reclaimed, the heap is expanded if possible.
1088: .PP
1089: Co-expression stack blocks are allocated in a separate region and
1090: are treated specially by the garbage collector.
1091: .NH 2
1092: Stack Organization
1093: .PP
1094: The stack is the focus of activity
1095: during the execution of an Icon program.
1096: All operators, functions, and procedures
1097: expect to find their arguments at the top of the stack,
1098: and replace the arguments with the result of their computation.
1099: Local variables for Icon procedures are also kept on the stack.
1100: The arguments, local variables, and temporaries on the stack
1101: for an active Icon procedure are collectively called a
1102: .I "procedure frame" .
1103: This is one of several kinds of
1104: .I "stack frames"
1105: discussed in this section.
1106: Appendix E summarizes the layouts of the stack frames for the PDP-11 and VAX-11.
1107: See [3] for a detailed discussion of stack frames.
1108: Each co-expression also has a stack. For uniformity, the main stack
1109: is treated as the stack for the co-expression \fM&main\fR.
1110: .PP
1111: On the PDP-11 and VAX-11 stacks start in high memory and grow downward.
1112: On these computers, a push causes the stack pointer to decrease and a
1113: pop causes the stack pointer to increase. Thus ``above'' and ``below''
1114: refer, respectively, to ``older'' and ``newer'' information on the
1115: stack. An exception to this is the phrase
1116: ``top of the stack'', which is used to
1117: refer to the \fIlowest\fR memory location. The description of relative
1118: stack locations that follows is based on this kind of architecture
1119: and nomenclature.
1120: .PP
1121: Before an Icon procedure calls another Icon procedure,
1122: the caller pushes the procedure to be called
1123: (a descriptor \(em procedures are data objects in Icon)
1124: onto the stack.
1125: The caller then pushes each argument (also a descriptor) onto the stack,
1126: leftmost argument first.
1127: The caller then pushes one word onto the stack
1128: indicating the number of arguments supplied,
1129: which may be different from the number of arguments expected.
1130: The run-time library routine
1131: \fMinvoke\fR
1132: is then called, which checks that the first descriptor pushed above
1133: actually does represent an integer, procedure, or a variable whose value is
1134: an integer or a procedure.
1135: An integer indicates the selection of one of the
1136: arguments resulting from mutual evaluation.
1137: A procedure, on the other hand,
1138: points to a procedure data block,
1139: which contains various information about the called procedure,
1140: including the number of arguments expected,
1141: the number of local variables used,
1142: and the procedure's entry point address.
1143: The routine
1144: \fMinvoke\fR
1145: next adjusts the number of arguments supplied to match the number expected,
1146: deleting excess arguments or supplying the null value for missing ones.
1147: This adjustment is performed by moving the portion of the stack
1148: below the arguments up or down, as appropriate.
1149: It then dereferences the arguments.
1150: A
1151: .I "procedure marker"
1152: is then pushed onto the stack,
1153: and the
1154: .I "procedure frame pointer"
1155: is set to point to the new procedure marker.
1156: The procedure marker contains, among other things,
1157: the return address in the calling procedure
1158: and the previous value of the procedure frame pointer.
1159: Next, the null value is pushed onto the stack
1160: as the initial value for each local variable.
1161: The routine
1162: \fMinvoke\fR
1163: then transfers control to the procedure's entry point,
1164: and execution of the Icon program resumes in the new procedure.
1165: .PP
1166: When a procedure is ready to return to its caller,
1167: it pushes its return value (a descriptor) on the stack.
1168: It then transfers control to
1169: \fMpret\fR,
1170: which moves the return value
1171: to the location occupied by the descriptor that represented the
1172: called procedure.
1173: That is, the return value is stored in place of the
1174: first descriptor that was pushed
1175: at the beginning of the calling sequence described above.
1176: The return sequence
1177: then restores the state of the previous procedure
1178: from the current procedure marker
1179: (the procedure marker that the procedure frame pointer
1180: currently points to).
1181: This includes restoring the previous value of the procedure frame pointer,
1182: retrieving the return address,
1183: and popping the returning procedure's local variables,
1184: procedure marker, and arguments.
1185: Thus, when the calling procedure regains control,
1186: the arguments have been popped
1187: and the return value is now at the top of the stack.
1188: .PP
1189: Functions and operators are written in C,
1190: and therefore obey the C calling sequence.
1191: By design, the Icon calling sequence described above
1192: is similar to the C calling sequence.
1193: When an Icon procedure calls a function, a
1194: .I boundary
1195: on the stack is introduced,
1196: where the stack above the boundary is regimented by Icon standards,
1197: and the stack below the boundary contains C information.
1198: This boundary is important during garbage collection:
1199: The garbage collector must ignore the area of the stack
1200: below the boundary,
1201: since the structure of this area is unknown,
1202: whereas the structure of the area above the boundary is
1203: well-defined.
1204: In particular, all data above the boundary is contained
1205: in descriptors or is defined by the structure of a frame,
1206: so that all pointers into the heap or string space may be
1207: located during a garbage collection.
1208: .PP
1209: Functions and operators are written to \*(oqstraddle\*(cq
1210: the boundary.
1211: From above, they are designed to resemble Icon procedures;
1212: from below, they are C procedures.
1213: An Icon procedure calls a function in the same way as it
1214: calls another Icon procedure;
1215: in fact, functions are procedure-typed
1216: data objects just as Icon procedures are.
1217: When
1218: \fMinvoke\fR
1219: recognizes that a function is being called,
1220: it bypasses the argument adjustment
1221: if the field in the procedure data block that indicates
1222: the number of arguments expected contains \-1,
1223: which indicates that the function can take an arbitrary number of arguments.
1224: It also does not allocate stack space for local variables, since
1225: any such variables are C variables and are allocated by the C
1226: function itself.
1227: C procedures have an entry sequence that creates a new
1228: procedure frame;
1229: since
1230: \fMinvoke\fR
1231: has already done this,
1232: the entry point for functions must be set past any instructions
1233: that are involved in procedure frame creation.
1234: .PP
1235: The first formal parameter of
1236: all functions is
1237: \fMnargs\fR,
1238: which corresponds to the word that contains the number of
1239: arguments supplied.
1240: For functions that expect a fixed number of arguments,
1241: they are also listed as arguments, in reverse order.
1242: For functions that can take an arbitrary number of arguments,
1243: there is a macro
1244: \fMARG(\fIn\fM)\fR
1245: that uses the address and contents
1246: of
1247: \fMnargs\fR
1248: to calculate the location of the
1249: .I n \^th
1250: argument.
1251: Thus,
1252: \fMARG(1)\fR
1253: accesses the first argument (as a descriptor),
1254: and
1255: \fMARG(nargs)\fR
1256: accesses the last argument.
1257: Each function is responsible for supplying defaults for missing arguments
1258: and for dereferencing arguments that are variables.
1259: Because of the calling protocol,
1260: \fMARG(0)\fR
1261: accesses the location where the return value should be stored.
1262: Every function must place its result there and
1263: then return through normal C conventions.
1264: Each function must also supply a procedure data block that contains
1265: the number of arguments expected (or \-1), its entry point,
1266: and a string qualifier representing its name.
1267: .PP
1268: Operators are very similar to functions. The only difference is that
1269: operators are called directly (rather than being called through \fMinvoke\fR)
1270: and must set the boundary themselves.
1271: .ig
1272: .PP
1273: Operators are written like functions,
1274: with one exception.
1275: Since operators are not variables
1276: (as function and procedure names are),
1277: the name of the operator is known at translation time,
1278: and the Icon procedure calls it directly
1279: (at its normal entry point)
1280: without going through
1281: \fMinvoke\fR.
1282: Thus, operators do not need procedure data blocks.
1283: Instead of pushing a procedure descriptor pointing to a procedure block,
1284: a null value is pushed instead to hold a place for the return value.
1285: ..
1286: .PP
1287: When an operator or function fails to produce a result,
1288: it calls
1289: \fMfail\fR.
1290: This routine initiates backtracking as described below.
1291: .PP
1292: Expressions are evaluated within an
1293: .I "expression frame" .
1294: When the evaluation of an expression is complete,
1295: whether it has produced a result or failed,
1296: the expression frame must be popped from the stack
1297: and the result of the expression must be pushed back onto the stack.
1298: The expression frame marks the stack height at the point
1299: that the expression began to be evaluated,
1300: so that the stack may be restored to its original state
1301: when the evaluation of the expression is complete.
1302: The stack normally would be restored to the original height
1303: (that is, the pops would match the pushes)
1304: except when an expression fails at some midpoint in its evaluation.
1305: The expression frame is also used to limit backtracking:
1306: backtracking is restricted in the language
1307: to the current expression instance only.
1308: .PP
1309: When evaluation of an expression begins, an
1310: .I "expression marker"
1311: is pushed on the stack, the
1312: .I "expression frame pointer"
1313: is set to point to it, and the
1314: .I "generator frame pointer" ,
1315: discussed below, is cleared.
1316: The marker contains the previous values of the
1317: expression and generator frame pointers
1318: and a failure label.
1319: When an expression produces a result,
1320: that result, on the top of the stack,
1321: is popped and saved.
1322: Then the stack is popped to the expression marker,
1323: and the previous values of the two frame pointers are restored.
1324: The marker is popped
1325: and the result of the expression is pushed back onto the stack,
1326: now a part of the previous expression frame.
1327: If an expression fails to produce a result,
1328: \fMfail\fR
1329: pops the stack to the expression marker,
1330: restores the previous values of the two frame pointers,
1331: and branches to the failure label.
1332: In the special case that the failure label is zero,
1333: \fMfail\fR
1334: is effectively called again to indicate failure in the new expression frame.
1335: Thus the failure is propagated from one expression to an enclosing one.
1336: .PP
1337: If an expression has any generators,
1338: then there is a
1339: .I "generator frame"
1340: within the current expression frame
1341: for each generator that is inactive
1342: (that is, that has produced a value but is not yet exhausted).
1343: A generator frame preserves the state of the stack
1344: at the point just before the generator
1345: (whether it be operator, function, or procedure)
1346: suspended (became inactive).
1347: If
1348: \fMfail\fR
1349: is called and there are inactive generators,
1350: then instead of exiting the current expression frame,
1351: the most recently inactivated generator is reactivated
1352: by restoring the stack to the state saved in the most recent
1353: generator frame.
1354: .PP
1355: A function or operator suspends itself by calling
1356: \fMsuspend\fR.
1357: This routine preserves the state of the stack
1358: by duplicating the current expression frame,
1359: bounded on one end by the most recent generator frame
1360: (or, if there are no inactive generators, the current expression frame)
1361: and bounded on the other end by the beginning of the
1362: argument list of the suspending function or operator.
1363: A generator marker is pushed onto the stack,
1364: followed by the duplicate expression frame.
1365: The routine
1366: \fMsuspend\fR
1367: then causes the suspending function or operator to return to its caller,
1368: instead of itself returning.
1369: .PP
1370: When reactivated by
1371: \fMfail\fR,
1372: the stack is restored to the generator marker,
1373: which is used to restore the various frame pointers.
1374: Then the marker is popped.
1375: The stack is then in the same state that it was in when
1376: \fMsuspend\fR
1377: was called.
1378: The routine
1379: \fMfail\fR
1380: then returns to the generator
1381: as if the original call to
1382: \fMsuspend\fR
1383: had returned.
1384: Thus the following schema is typical of operators and functions
1385: that generate a sequence of values.
1386: .DS
1387: .ss 9
1388: .ft M
1389: \fIinitialize\fM;
1390: \fMwhile\fM (\fInot exhausted\fM) {
1391: \fIcompute next value\fM;
1392: \fIstore return value\fM;
1393: suspend()
1394: }
1395: fail();
1396: .ft R
1397: .ss 4
1398: .DE
1399: The effect of resuming an expression containing generators is that
1400: \fMsuspend\fR
1401: actually causes the generator to return.
1402: If alternatives are needed, backtracking occurs,
1403: and the apparent effect is that
1404: \fMsuspend\fR
1405: has finally returned.
1406: The generator computes the next result,
1407: and suspends with it.
1408: When the generator is exhausted,
1409: it merely fails without suspending,
1410: which just passes the failure back to the next most recently inactivated generator,
1411: if any.
1412: .PP
1413: Just as functions and operators can return normally, suspend, or fail,
1414: so can Icon procedures.
1415: The mechanics are essentially the same, but the differences in stack
1416: layout require different primitives.
1417: When Icon procedures return normally, the return value is presumed to
1418: be at the top of the stack and
1419: \fMpret\fR
1420: is called.
1421: Similarly, Icon procedures call
1422: \fMpsusp\fR
1423: to suspend.
1424: Both of these routines also dereference the return result
1425: if it is a local variable.
1426: The routine
1427: \fMpfail\fR
1428: causes an Icon procedure to return with no result.
1429: .PP
1430: The same three primitives are also needed at the expression level:
1431: \fMeret\fR,
1432: \fMesusp\fR,
1433: and
1434: \fMefail\fR.
1435: Unlike
1436: \fMunmark\fR,
1437: \fMeret\fR
1438: is not a library routine, but is generated as in-line code.
1439: Both cause an exit from the current expression frame; but
1440: \fMeret\fR
1441: supplies a result to the enclosing expression,
1442: while
1443: \fMunmark\fR
1444: does not.
1445: The routine
1446: \fMesusp\fR
1447: creates a inactive generator before supplying a result to the
1448: enclosing expression;
1449: it is used by the alternation control structure.
1450: The routine
1451: \fMefail\fR
1452: simply causes backtracking within the current expression frame.
1453: In fact,
1454: \fMfail\fR
1455: and
1456: \fMpfail\fR
1457: merely exit their procedure frame before branching to
1458: \fMefail\fR.
1459: .NH 2
1460: Storage Allocation and Reclamation
1461: .PP
1462: During program execution, storage allocation is necessary when a data
1463: object is created.
1464: The three primitive routines
1465: \fMallocate\fR,
1466: \fMalcstr\fR, and \fMalcestk\fR
1467: allocate storage in the heap, string space, and co-expression stack space, respectively.
1468: All three routines return pointers to the beginning of newly
1469: allocated space.
1470: None of the routines is responsible for ensuring that enough space
1471: remains in the data regions.
1472: Ensuring that enough space remains in the data regions is
1473: the responsibility of a
1474: .I "predictive need"
1475: strategy described below.
1476: .PP
1477: In the heap,
1478: \fMallocate(n)\fR
1479: returns a pointer to
1480: \fMn\fR
1481: contiguous bytes of storage.
1482: Because a wide variety of objects may reside in the heap, a number of
1483: support routines are provided to simplify the storing of various objects.
1484: There is a specific routine to allocate a block for each datatype in the heap.
1485: Where appropriate, these routines have the actual values to be stored
1486: as their arguments.
1487: All of the routines call
1488: \fMallocate\fR
1489: to obtain storage for the object.
1490: .PP
1491: In the string space,
1492: \fMalcstr(s,\*bl)\fR
1493: allocates room for a string of length
1494: \fMl\fR
1495: and copies the string pointed to by
1496: \fMs\fR
1497: into this space.
1498: Since some routines such as
1499: \fMleft\fR,
1500: \fMright\fR,
1501: and
1502: \fMcenter\fR
1503: need room in the string space in which to construct a string,
1504: a call to
1505: \fMalcstr\fR
1506: with the defined constant
1507: \fMNULL\fR
1508: as the first argument results in the
1509: allocation of storage without attempting to copy a string.
1510: .PP
1511: In the co-expression stack region, \fMalcestk()\fR allocates a new
1512: co-expression stack.
1513: .PP
1514: Source code for all of the allocation routines is contained in the file
1515: \fMrt/alc.c\fR.
1516: Almost all interaction with the storage management is made through these
1517: routines.
1518: Two exceptions occur in string concatenation (\fMoperators/cat.c\fR) and
1519: reading a fixed number of
1520: bytes from a file (\fMfunctions/reads.c\fR).
1521: In these cases, it is simpler and more efficient to have these operations
1522: deal directly with storage management.
1523: .PP
1524: As mentioned earlier, a
1525: .I "predictive need"
1526: strategy is employed to ensure that enough room remains for data storage.
1527: Simply put,
1528: .I "predictive need"
1529: states that it is the responsibility of any routine
1530: that calls an allocation routine both to ensure that enough room remains
1531: in the proper data region and to maintain the validity of any temporary
1532: pointers into the data regions, should a
1533: .I "garbage collection"
1534: be necessary to free storage space.
1535: .PP
1536: Since the check for storage space only needs to occur before the allocation
1537: takes place, each routine may perform this check at its convenience.
1538: This approach permits the minimization of the number of temporary
1539: pointers that must be protected during garbage collection.
1540: As an aid, space for several descriptors is automatically protected
1541: by the procedure invocation mechanism, and usually is used to hold
1542: information pertaining to the arguments of the procedure (see Section 3.4).
1543: .PP
1544: Routines to ensure space are provided for each of the three
1545: storage regions.
1546: The routine
1547: \fMsneed(n)\fR
1548: ensures that at least
1549: \fMn\fR
1550: bytes of storage remain in the string space, and
1551: \fMhneed(n)\fR
1552: performs the same function in the heap.
1553: \fMesneed()\fR ensures that there is a co-expression stack
1554: available.
1555: If either routine finds that there is insufficient storage remaining,
1556: it invokes the
1557: .I "garbage collector"
1558: in an attempt to obtain that storage.
1559: If that fails, then program execution is aborted with an appropriate
1560: diagnostic.
1561: .PP
1562: Garbage collection, or
1563: .I "storage reclamation" ,
1564: is a process that identifies all valid allocated data.
1565: In the string and heap regions, valid data is compacted
1566: in order to provide a contiguous area of unused storage.
1567: The algorithm used for identifying valid data is based upon
1568: the algorithm described by Hanson [7].
1569: Only the more novel features are discussed here.
1570: .PP
1571: Whenever a predictive need request discovers that insufficient
1572: storage remains in either the heap or string space, the garbage
1573: collector is invoked to reclaim unused space in all regions.
1574: This approach is more efficient in situations where all regions
1575: are heavily allocated and only slightly less efficient otherwise.
1576: .PP
1577: The approach is to sweep through the permanent data regions and the
1578: stack, looking for descriptors that are either pointers into the heap or
1579: string qualifiers.
1580: When a string qualifier is found, a pointer to that qualifier is
1581: saved in a temporary data region at the end of the heap.
1582: If the descriptor is a pointer into the heap, then that heap data
1583: block contains valid information.
1584: The block is marked as valid, the descriptor is placed on a back chain
1585: headed in the block,
1586: and the marking process is called
1587: recursively on any descriptors within that block.
1588: Blocks that are already marked as valid are not processed a second time.
1589: To simplify the marking of heap blocks, all data blocks have been
1590: designed so that all descriptors within them exist as a contiguous
1591: section at the end of the block.
1592: Thus to sweep through the descriptors within a block, the marking
1593: algorithm need only know the size of the block and the
1594: location of the first descriptor.
1595: Information concerning a data block's size, as well as the
1596: offset for the first descriptor is in the file
1597: \fMrt/dblocks.c\fR.
1598: .PP
1599: Valid co-expression stacks also may contain string qualifiers and
1600: pointers to other valid data; such stacks are included in the
1601: marking phase.
1602: .PP
1603: After the marking phase is completed, the string region is compacted.
1604: The algorithm used is described by Hanson [9].
1605: The pointers to the string qualifiers are sorted so that the order of
1606: all valid strings within the string space is identified.
1607: The string qualifiers are then processed in order, and modified as
1608: the valid strings are compacted.
1609: If this compaction does not free enough space within the
1610: string space to satisfy the request, the heap must be moved in order
1611: to provide more room in the string space.
1612: An attempt is also made to provide
1613: some additional
1614: ``breathing room''
1615: in the string space to permit future expansion.
1616: .PP
1617: The heap cannot be moved until after the valid pointers into it are
1618: adjusted and the storage is compacted.
1619: The pointer adjustment and heap compaction phases are two linear passes through
1620: the heap which must be performed during standard heap garbage collection.
1621: The only difference when the heap is to be moved is that the adjusted
1622: pointers point to where that data will be after the heap has been moved.
1623: If not enough breathing room is freed in the heap, then
1624: more space is requested from the operating system.
1625: As a last step, if the string space needs more room, the heap is
1626: relocated.
1627: .PP
1628: This method has proved to be quite satisfactory for most applications.
1629: A shortcoming of the implementation is the absence of a process for
1630: decreasing the size of a data region, should it become too large.
1631: It is also possible that insufficient room would be available for storing
1632: the pointers to the string qualifiers, even though enough storage
1633: would become available if the heap were collected separately.
1634: In practice, this has not been a problem.
1635: The source code for the garbage collector is contained in the files
1636: \fMrt/gcollect.s\fR, \fMrt/gc.c\fR, and \fMrt/sweep.c\fR.
1637: .NH 2
1638: Coding Conventions
1639: .PP
1640: The calling conventions for functions and operators have been mentioned
1641: earlier.
1642: Several other aspects of the run-time system are explained here.
1643: .PP
1644: All header files for the run-time system are in the directory
1645: \fMh\fR.
1646: The file
1647: \fMh/rt.h\fR
1648: (or, for assembly-language routines,
1649: \fMh/defs.s\fR)
1650: is included by almost every source file in the run-time system, and contains
1651: machine-dependent defined constants, run-time data structure declarations,
1652: and defined constants and macros for flags,
1653: type codes, argument accessing, and bit manipulations.
1654: .PP
1655: During the execution of an Icon program, many type
1656: conversions are done on temporary values, where data storage is not
1657: required beyond the bounds of the current operation.
1658: For this reason, the type conversion routines all operate with pointers passed
1659: to them that reference
1660: buffers in the calling procedure.
1661: Any routine calling for type conversion must determine if
1662: heap or string space storage is needed, and perform the
1663: allocation.
1664: Most of the conversion routines return the type of the result or
1665: \fMNULL\fR
1666: if the conversion cannot be performed.
1667: One exception is
1668: \fMcvstr\fR
1669: which, in addition to
1670: \fMNULL\fR,
1671: returns
1672: \fM2\fR
1673: if the object was already a string, and
1674: \fM1\fR
1675: if the object
1676: had to be converted to a string.
1677: This distinction makes it possible to avoid a large number of
1678: predictive-need checks.
1679: The second exception is
1680: \fMcvnum\fR,
1681: which returns either
1682: real numbers or integers
1683: and makes no attempt to distinguish between
1684: short and long integers.
1685: .PP
1686: As mentioned in Section 3.3, there is space set aside to hold
1687: temporary descriptors and to protect the validity of these
1688: descriptors during garbage collection.
1689: The garbage collector knows about this region, and
1690: .I tends
1691: it during storage reclamation.
1692: The region is defined in the file
1693: \fMh/gc.h\fR,
1694: and is bounded by the labels
1695: \fMtended\fR
1696: and
1697: \fMetended\fR.
1698: This area can be referenced from C by considering
1699: \fMtended\fR
1700: to be an array of descriptors.
1701: Since a garbage collection can occur
1702: only during a call to
1703: \fMsneed\fR
1704: or
1705: \fMhneed\fR,
1706: or between suspension and reactivation,
1707: the only places where C routines need to ensure that all pointers
1708: into the heap or string space are tended
1709: are just before calls to
1710: \fMsneed\fR,
1711: \fMhneed\fR,
1712: or
1713: \fMsuspend\fR.
1714: .PP
1715: All function names are preceded by the letter
1716: \fMX\fR,
1717: and their procedure blocks are preceded by the letter
1718: \fMB\fR.
1719: This prevents name collisions between Icon procedures and other routines,
1720: such as those for operators, type conversions, and storage management.
1721: Reference from the generated code to functions is made entirely through
1722: the procedure block;
1723: the entry point field of the procedure block references the function itself.
1724: .NH
1725: Modifying the Implementation
1726: .PP
1727: This section is intended to serve as a brief guide
1728: for those who wish to modify the Icon system.
1729: It is not comprehensive;
1730: it only points to various parts of the
1731: implementation that need to be considered
1732: when making various kinds of changes.
1733: .PP
1734: Perhaps the most common kind of change that one might expect to make
1735: is to add new functions (built-in procedures).
1736: To add a function, first write it according to the conventions
1737: described in Section 3.4.
1738: (Use an existing function similar to the new one as a prototype.
1739: Appendix F contains several example functions.) Be
1740: especially careful to observe the rules
1741: concerning storage allocation and tended descriptors.
1742: .PP
1743: .c note PI here
1744: Prepare to add the new function to the run-time library
1745: by moving the source code into the
1746: \fMfunctions\fR
1747: directory and adding its name to
1748: \fMfunctions/Makefile\fR
1749: (the name must be added in three places \(em
1750: there are many examples already in the \fMMakefile\fR).
1751: Then add the name to \fMh/pdef.h\fR
1752: in proper alphabetical order.
1753: Use other functions as a guide to the format of the entry.
1754: .PP
1755: When all changes have been made to the source code,
1756: go to the Icon root directory and run
1757: .Ds
1758: make Icon
1759: .De
1760: This runs
1761: \fMmake\fR
1762: in each of the system directories \(em
1763: \fMtran\fR,
1764: \fMlink\fR,
1765: \fMfunctions\fR,
1766: \fMoperators\fR,
1767: \fMlib\fR,
1768: \fMrt\fR, and \fMiconx\fR
1769: \(em and then copies the new versions into the
1770: \fMbin\fR
1771: directory [10].
1772: .PP
1773: Adding a new operator is more complicated.
1774: Again, the first step is to write the routine, place it in the
1775: \fMoperators\fR
1776: directory, and add the appropriate entry to the
1777: \fMMakefile\fR
1778: there.
1779: Next, the operator must be added to the translator, as follows:
1780: .IP (1) \w'(1)'u+1n
1781: Add the operator to the operator table in
1782: \fMtran/optab\fR;
1783: the structure of the table is described in Section 1.1.
1784: .IP (2)
1785: Create a unique name for the new token
1786: and make a new token table entry in
1787: \fMtran/tokens\fR
1788: in the operators section of the table.
1789: Although the operators section of the table is in alphabetical order
1790: by token name
1791: as distributed,
1792: there is no need to preserve this order.
1793: .IP (3)
1794: If a running Version 5 of Icon is not available,
1795: edit the files
1796: \fMtran/optab.c\fR
1797: and
1798: \fMtran/toktab.c\fR
1799: to correspond to the changes made in steps 1 and 2.
1800: This sometimes involves a renumbering of token table entries
1801: in both files (but nowhere else).
1802: If a running Version 5 of Icon is available,
1803: a
1804: \fMmake\fR
1805: in
1806: \fMtran\fR
1807: executes
1808: \fMmktoktab\fR
1809: to produce the new token tables.
1810: .IP (4)
1811: Add the operator to the grammar in
1812: \fMtran/icon.g\fR.
1813: The token name must be added to the list of terminal symbols
1814: at the beginning of the grammar file,
1815: and the operator must be inserted into the syntax at the
1816: appropriate precedence level.
1817: If the precedence is the same as that of an existing operator,
1818: simply add the operator as an alternative to the existing production;
1819: otherwise, insert a new production,
1820: and change the production at the next lower precedence level
1821: to refer to the new one.
1822: The semantic action should create either a
1823: .I BINOP
1824: or a
1825: .I UNOP
1826: node in the parse tree;
1827: use existing actions as a prototype.
1828: .IP (5)
1829: The new operator must now be added to the code generator in
1830: \fMtran/code.c\fR.
1831: Insert a case in either of the routines
1832: \fMbinop\fR
1833: or
1834: \fMunop\fR
1835: for the new token name
1836: that assigns a new intermediate code opcode to
1837: \fMname\fR,
1838: as for other operators \(em
1839: this causes the new opcode to be emitted into the ucode.
1840: The opcode should have the same name as the library routine
1841: that performs the operation.
1842: .IP (6)
1843: The new intermediate code opcode must also be added to the linker.
1844: Add a defined constant to
1845: \fMlink/opcode.h\fR;
1846: order here is not important.
1847: Then add the opcode name and the defined constant to
1848: \fMlink/opcode.c\fR;
1849: alphabetical order must be preserved here,
1850: since a binary search is used.
1851: Then edit the code generator in
1852: \fMlink/lcode.c\fR,
1853: adding a case in the routine
1854: \fMgencode\fR
1855: with either the binary or the unary operators.
1856: The standard processing here emits code that
1857: evaluates the operand(s),
1858: then calls a library routine
1859: with the same name as the intermediate code opcode.
1860: .IP (7)
1861: Add entries for the operator to \fMh/pnames.h\fR, using other operators as a guide.
1862: .LP
1863: The system is then be ready to be made as described above.
1864: .PP
1865: Adding a new control structure is similar in nature to
1866: adding a new operator.
1867: Most often, a new reserved word must be added to
1868: \fMtran/tokens\fR;
1869: this part of the token table must be kept in alphabetical order.
1870: The new token must be added to the grammar,
1871: and productions must be added,
1872: usually at the highest precedence level
1873: (the same as
1874: \fMif\fR,
1875: for example).
1876: The semantic action for the new production probably will
1877: involve creating a parse tree node of a new type.
1878: The new node type should be added to
1879: \fMtran/tree.h\fR
1880: and a new case in the routine
1881: \fMtraverse\fR
1882: (in
1883: \fMtran/code.c\fR)
1884: should be added to generate intermediate code.
1885: The intermediate code generated can use any of the existing
1886: opcodes or can use new ones created specifically for the new control structure.
1887: If new opcodes are created,
1888: they must be added to the linker as described above,
1889: and a new case in the routine
1890: \fMgencode\fR
1891: must generate code for it.
1892: The generated code can be either entirely in-line
1893: or can call a new library routine.
1894: If new code generation templates are needed,
1895: modify the code emission routines
1896: in
1897: \fMlink/lcode.c\fR.
1898: If the code calls a new library routine,
1899: add it to the
1900: \fMlib\fR
1901: directory and the
1902: \fMMakefile\fR
1903: there.
1904: Then the system is ready to be made.
1905: .PP
1906: Modifying the semantics of existing control structures,
1907: operators, or functions,
1908: often involves changing only the generated in-line code
1909: or a library routine.
1910: Modifying the syntax without disturbing any semantics
1911: usually requires only a change to the grammar.
1912: .PP
1913: Adding a new datatype means making many of the above changes.
1914: A new datatype code must be added to
1915: \fMh/rt.h\fR
1916: and
1917: \fMh/defs.s\fR,
1918: and a new data block format must be defined in \fMh/rt.h\fR.
1919: The size and location of the first descriptor of the new data block
1920: must be entered in
1921: \fMrt/dblocks.c\fR
1922: so that the garbage collector knows how to treat the block.
1923: The routines in
1924: \fMfunctions/image.c\fR
1925: and
1926: \fMrt/outimage.c\fR
1927: must be extended so that images of the new datatype can be produced.
1928: The routines in \fMfunctions/type.c\fR and \fMrt/equiv.c\fR must
1929: also be modified to account for the new type.
1930: In addition, \fMrt/anycmp.c\fR must be extended so that objects
1931: of the new type can be sorted relative to other types.
1932: New functions and operators on the new type may be needed,
1933: and possibly new coercion routines must be added to
1934: \fMrt\fR.
1935: .PP
1936: Adding a new keyword entails a change to
1937: \fMtran/keywords\fR
1938: (and, if a running Version 5 of Icon is not available, to
1939: \fMtran/keyword.h\fR)
1940: and a new case in
1941: \fMlib/keyword.c\fR.
1942: A
1943: \fMmake\fR
1944: in
1945: \fMtran\fR
1946: runs the program
1947: \fMmkkeytab\fR
1948: to produce both
1949: \fMtran/keyword.h\fR
1950: and
1951: \fMtran/keyword.c\fR.
1952: Many keywords require trapped variables,
1953: which requires changes to
1954: \fMh/rt.h\fR,
1955: \fMoperators/asgn.c\fR,
1956: and
1957: \fMrt/deref.c\fR;
1958: the trapped variable for
1959: \fM&random\fR
1960: is a good model.
1961: .PP
1962: As mentioned above,
1963: the examples in this section are intended
1964: to identify what parts of the system are affected
1965: by certain kinds of changes or extensions.
1966: A thorough understanding of the system is
1967: suggested, however, for other than minor changes.
1968: .bp
1969: .SH
1970: Acknowledgements
1971: .PP
1972: This report is a revision of earlier descriptions of the C
1973: implementation of Icon [11, 12], which were co-authored by Cary Coutant
1974: and Steve Wampler.
1975: Much of the material in this report is taken from the earlier ones.
1976: .PP
1977: Many features of the current implementation of Icon are based
1978: upon the original Ratfor implementation by
1979: Dave Hanson, Tim Korb, and Walt Hansen [13, 14].
1980: .sp 1
1981: .SH
1982: References
1983: .IP 1. 5n
1984: Griswold, Ralph E. and Madge T. Griswold.
1985: .I "The Icon Programming Language" .
1986: Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1983.
1987: .IP 2. 5n
1988: Kernighan, Brian W., and Dennis M. Ritchie.
1989: .I "The C Programming Language" .
1990: Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1978.
1991: .IP 3. 5n
1992: Mitchell, William W. \fIPorting the UNIX Implementation of Icon\fR.
1993: Technical Report TR 83-10d, Department of Computer Science, The
1994: University of Arizona, Tucson, Arizona, August 1984.
1995: .IP 4. 5n
1996: Griswold, Ralph E., Robert K. McConeghy, and William H. Mitchell. \fIVersion 5.9 of Icon\fR.
1997: Technical Report, Department of Computer Science, The University of
1998: Arizona, Tucson, Arizona, July 1984.
1999: .IP 5. 5n
2000: Johnson, Stephen C.
2001: ``Yacc: Yet Another Compiler-Compiler''
2002: .I "Unix Programmer's Manual, Seventh Edition" .
2003: Bell Telephone Laboratories, Inc., Murray Hill, New Jersey, 1979.
2004: .IP 6. 5n
2005: Aho, A. V., and S. C. Johnson.
2006: ``LR Parsing''"
2007: .I "Computing Surveys"
2008: .B 6 ,
2009: 2 (June 1974), 99-124.
2010: .IP 7. 5n
2011: Hanson, David R.
2012: ``Storage Management for an Implementation of SNOBOL4'',
2013: .I "Software\(emPractice and Experience"
2014: .B 7 ,
2015: 2 (March 1977), 179-192.
2016: .IP 8. 5n
2017: Hanson, David R.
2018: ``Variable Associations in SNOBOL4'',
2019: .I "Software\(emPractice and Experience"
2020: .B 6 ,
2021: 2 (April 1976), 245-254.
2022: .IP 9. 5n
2023: Hanson, David R.
2024: .I "The Manipulation of Variable-Length String Data in Fortran IV" .
2025: Technical Report,
2026: Department of Computer Science,
2027: The University of Arizona, Tucson, Arizona,
2028: May 1975.
2029: .IP 10. 5n
2030: Griswold, Ralph E. and William H. Mitchell. \fIInstallation and Maintenance
2031: Guide for Version 5.9 of Icon\fR. Technical Report TR 84-13,
2032: Department of Computer Science,
2033: The University
2034: of Arizona,
2035: Tucson, Arizona. August 1984.
2036: .IP 11. 5n
2037: Coutant, Cary A. and Stephen B. Wampler. \fIA Tour Through the
2038: C Implementation of Icon; Version 5\fR. Technical Report TR 81-11a,
2039: Department of Computer Science, The University of Arizona, Tucson,
2040: Arizona, December 1981.
2041: .IP 12. 5n
2042: Griswold, Ralph E., William H. Mitchell, and Stephen B. Wampler.
2043: \fIThe C Implementation of Icon; A Tour Through Version 5\fR.
2044: Technical Report TR 83-11a, Department of Computer Science,
2045: The University of Arizona, Tucson, Arizona. December 1983.
2046: .IP 13. 5n
2047: Korb, John Timothy.
2048: .I "The Design and Implementation of a Goal-Directed Programming Language" .
2049: Ph.D. Dissertation, Technical Report TR 79-11,
2050: Department of Computer Science,
2051: The University of Arizona, Tucson, Arizona,
2052: June 1979.
2053: .IP 14. 5n
2054: Hanson, David R., and Walter J. Hansen.
2055: .I "Icon Implementation Notes" .
2056: Technical Report TR 79-12a,
2057: Department of Computer Science,
2058: The University of Arizona, Tucson, Arizona,
2059: February 1980.
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.