Annotation of 43BSDReno/share/doc/smm/19.porttour/porttour1.ms, revision 1.1

1.1     ! root        1: .\"    @(#)porttour1.ms        6.3 (Berkeley) 5/30/86
        !             2: .\"
        !             3: .ds f \fIf77\fP
        !             4: .ds U \s-2UNIX\s0
        !             5: .ds V \s-2VAX\s0
        !             6: .de Bl
        !             7: .na
        !             8: .fi
        !             9: .ll 3i
        !            10: ..
        !            11: .de Sp
        !            12: .sp 0.25v
        !            13: ..
        !            14: .TL
        !            15: A Tour Through the Portable C Compiler
        !            16: .AU
        !            17: S. C. Johnson
        !            18: .AI
        !            19: AT&T Bell Laboratories
        !            20: .sp
        !            21: .AU
        !            22: Donn Seeley
        !            23: .AI
        !            24: Department of Computer Science
        !            25: University of Utah
        !            26: .AB
        !            27: .OH 'A Tour Through the Portable C Compiler''SMM:19-%'
        !            28: .EH 'SMM:19-%''A Tour Through the Portable C Compiler'
        !            29: .PP
        !            30: Since its introduction,
        !            31: the Portable C Compiler has become the standard
        !            32: \*U C compiler for many machines.
        !            33: Three quarters or more of the code
        !            34: in the compiler is machine independent
        !            35: and much of the rest can be generated easily
        !            36: using knowledge of the target architecture.
        !            37: This paper describes the structure
        !            38: and organization of the compiler
        !            39: and tries to further simplify the job
        !            40: of the compiler porter.
        !            41: .PP
        !            42: This document originally appeared with the Seventh Edition of \*U,
        !            43: and has been revised and extended for publication
        !            44: with the Fourth Berkeley Software Distribution.
        !            45: The new material covers changes which
        !            46: have been made in the compiler since the Seventh Edition,
        !            47: and includes some discussion of secondary topics
        !            48: which were thought to be of interest
        !            49: in future ports of the compiler.
        !            50: .sp
        !            51: .LP
        !            52: Revised April, 1986
        !            53: .AE
        !            54: .SH
        !            55: Introduction
        !            56: .PP
        !            57: A C compiler has been implemented that has proved to be quite portable,
        !            58: serving as the basis for C compilers on roughly a dozen machines, including
        !            59: the DEC \*V, Honeywell 6000, IBM 370, and Interdata 8/32.
        !            60: The compiler is highly compatible with the C language standard.
        !            61: .[
        !            62: Kernighan Ritchie Prentice 1978
        !            63: .]
        !            64: .PP
        !            65: Among the goals of this compiler are portability, high reliability,
        !            66: and the use of state-of-the-art techniques and tools wherever practical.
        !            67: Although the efficiency of the compiling process is not
        !            68: a primary goal, the compiler is efficient enough, and produces
        !            69: good enough code, to serve as a production compiler.
        !            70: .PP
        !            71: The language implemented is highly compatible with the current PDP-11 version of C.
        !            72: Moreover, roughly 75% of the compiler, including
        !            73: nearly all the syntactic and semantic routines, is machine independent.
        !            74: The compiler also serves as the major portion of the program
        !            75: .I lint ,
        !            76: described elsewhere.
        !            77: .[
        !            78: Johnson Lint Program Checker
        !            79: .]
        !            80: .PP
        !            81: A number of earlier attempts to make portable compilers are worth noting.
        !            82: While on CO-OP assignment to Bell Labs in 1973, Alan Snyder wrote a portable C
        !            83: compiler which was the basis of his Master's Thesis at M.I.T.
        !            84: .[
        !            85: Snyder Portable C Compiler
        !            86: .]
        !            87: This compiler was very slow and complicated, and contained a number of rather
        !            88: serious implementation difficulties;
        !            89: nevertheless, a number of Snyder's ideas appear in this work.
        !            90: .PP
        !            91: Most earlier portable compilers, including Snyder's, have proceeded by
        !            92: defining an intermediate language, perhaps based
        !            93: on three-address code or code for a stack machine,
        !            94: and writing a machine independent program to
        !            95: translate from the source code to this
        !            96: intermediate code.
        !            97: The intermediate code is then read by a second pass, and interpreted or compiled.
        !            98: This approach is elegant, and has a number of advantages, especially if the
        !            99: target machine is far removed from the host.
        !           100: It suffers from some disadvantages as well.
        !           101: Some constructions, like initialization and subroutine prologs,
        !           102: are difficult or expensive to express in a
        !           103: machine independent way that still allows them
        !           104: to be easily adapted to the target assemblers.
        !           105: Most of these approaches require a symbol table to be constructed
        !           106: in the second (machine dependent) pass, and/or
        !           107: require powerful target assemblers.
        !           108: Also, many conversion operators may be generated
        !           109: that have no effect on a given machine,
        !           110: but may be needed on others (for example, pointer to pointer
        !           111: conversions usually do nothing in C, but must be generated because
        !           112: there are some machines where they are significant).
        !           113: .PP
        !           114: For these reasons, the first pass of the portable compiler
        !           115: is not entirely machine independent.
        !           116: It contains some machine dependent features, such as initialization, subroutine
        !           117: prolog and epilog, certain storage allocation functions,
        !           118: code for the
        !           119: .I switch
        !           120: statement, and code to throw out unneeded conversion operators.
        !           121: .PP
        !           122: As a crude measure of the degree of
        !           123: portability actually achieved,
        !           124: the Interdata 8/32 C compiler has
        !           125: roughly 600 machine dependent lines of source out of 4600 in Pass 1, and 1000
        !           126: out of 3400 in Pass 2.
        !           127: In total, 1600 out of 8000, or 20%,
        !           128: of the total source is machine dependent
        !           129: (12% in Pass 1, 30% in Pass 2).
        !           130: These percentages can be expected to rise slightly as the
        !           131: compiler is tuned.
        !           132: The percentage of machine-dependent code for the IBM is 22%, for
        !           133: the Honeywell 25%.
        !           134: If the assembler format and structure were the same for all
        !           135: these machines, perhaps another 5-10% of the code
        !           136: would become machine independent.
        !           137: .PP
        !           138: These figures are sufficiently misleading as to be almost
        !           139: meaningless.
        !           140: A large fraction of the machine dependent code can be converted
        !           141: in a straightforward, almost mechanical way.
        !           142: On the other hand, a certain amount of the code requires hard
        !           143: intellectual effort to convert, since the algorithms
        !           144: embodied in this part of the code are typically complicated
        !           145: and machine dependent.
        !           146: .PP
        !           147: To summarize, however, if you need a C compiler written for a machine with
        !           148: a reasonable architecture, the compiler is already three quarters finished!
        !           149: .SH
        !           150: Overview
        !           151: .PP
        !           152: This paper discusses the structure and organization of the portable compiler.
        !           153: The intent is to give the big picture, rather than discussing the details of a particular machine
        !           154: implementation.
        !           155: After a brief overview and a discussion of the source file structure,
        !           156: the paper describes the major data structures, and then delves more
        !           157: closely into the two passes.
        !           158: Some of the theoretical work on which the compiler is based, and
        !           159: its application to the compiler, is discussed elsewhere.
        !           160: .[
        !           161: johnson portable theory practice
        !           162: .]
        !           163: One of the major design issues in any C compiler, the design of
        !           164: the calling sequence and stack frame, is the subject of a separate
        !           165: memorandum.
        !           166: .[
        !           167: johnson lesk ritchie calling sequence
        !           168: .]
        !           169: .PP
        !           170: The compiler consists of two passes,
        !           171: .I pass1
        !           172: and
        !           173: .I pass2 ,
        !           174: that together turn C source code into assembler code for the target
        !           175: machine.
        !           176: The two passes are preceded by a preprocessor, that
        !           177: handles the
        !           178: .B "#define"
        !           179: and
        !           180: .B "#include"
        !           181: statements, and related features (e.g.,
        !           182: .B #ifdef ,
        !           183: etc.).
        !           184: The two passes may optionally be followed by
        !           185: a machine dependent code improver.
        !           186: .PP
        !           187: The output of the preprocessor is a text file that is read as the standard
        !           188: input of the first pass.
        !           189: This
        !           190: produces as standard output another text file that becomes the standard
        !           191: input of the second pass.
        !           192: The second pass produces, as standard output, the desired assembler language source code.
        !           193: The code improver, if used, converts the assembler code
        !           194: to more effective code,
        !           195: and the result is passed to the assembler.
        !           196: The preprocessor and the two passes
        !           197: all write error messages on the standard error file.
        !           198: Thus the compiler itself makes few demands on the I/O
        !           199: library support, aiding in the bootstrapping process.
        !           200: .PP
        !           201: The division of the compiler into two passes is somewhat artificial.
        !           202: The compiler can optionally be loaded
        !           203: so that both passes operate in the same program.
        !           204: This ``one pass'' operation eliminates the overhead of
        !           205: reading and writing the intermediate file, so the compiler
        !           206: operates about 30% faster in this mode.
        !           207: It also occupies about 30% more space than the larger
        !           208: of the two component passes.
        !           209: This ``one pass'' compiler is the standard version
        !           210: on machines with large address spaces, such as the \*V.
        !           211: .PP
        !           212: Because the compiler is fundamentally structured as two
        !           213: passes, even when loaded as one, this document primarily
        !           214: describes the two pass version.
        !           215: .PP
        !           216: The first pass does the lexical analysis, parsing, and
        !           217: symbol table maintenance.
        !           218: It also constructs parse trees for expressions,
        !           219: and keeps track of the types of the nodes in these trees.
        !           220: Additional code is devoted to initialization.
        !           221: Machine dependent portions of the first pass serve to
        !           222: generate subroutine prologs and epilogs,
        !           223: code for switches, and code for branches, label definitions,
        !           224: alignment operations,
        !           225: changes of location counter, etc.
        !           226: .PP
        !           227: The intermediate file is a text file organized into lines.
        !           228: Lines beginning with a right parenthesis are copied by the second
        !           229: pass directly to its output file, with the
        !           230: parenthesis stripped off.
        !           231: Thus, when the first pass produces assembly code, such as subroutine prologs,
        !           232: etc., each line is prefaced with a right parenthesis;
        !           233: the second pass passes these lines to through to the assembler.
        !           234: .PP
        !           235: The major job done by the second pass is generation of code
        !           236: for expressions.
        !           237: The expression parse trees produced in the first pass are
        !           238: written onto the
        !           239: intermediate file in Polish Prefix form:
        !           240: first, there is a line beginning with a period, followed by the source
        !           241: file line number and name on which the expression appeared (for debugging purposes).
        !           242: The successive lines represent the nodes of the parse tree,
        !           243: one node per line.
        !           244: Each line contains the node number, type, and
        !           245: any values (e.g., values of constants)
        !           246: that may appear in the node.
        !           247: Lines representing nodes with descendants are immediately
        !           248: followed by the left subtree of descendants, then the
        !           249: right.
        !           250: Since the number of descendants of any node is completely
        !           251: determined by the node number, there is no need to mark the end
        !           252: of the tree.
        !           253: .PP
        !           254: There are only two other line types in the intermediate file.
        !           255: Lines beginning with a left square bracket (`[') represent the beginning of
        !           256: blocks (delimited by { ... } in the C source);
        !           257: lines beginning with right square brackets (`]') represent the end of blocks.
        !           258: The remainder of these lines tell how much
        !           259: stack space, and how many register variables,
        !           260: are currently in use.
        !           261: .PP
        !           262: Thus, the second pass reads the intermediate files, copies the `)' lines,
        !           263: makes note of the information in the `[' and `]' lines,
        !           264: and devotes most of its effort to the
        !           265: `.' lines and their associated expression trees, turning them
        !           266: turns into assembly code to evaluate the expressions.
        !           267: .PP
        !           268: In the one pass version of the compiler, the expression
        !           269: trees contain information useful to both logical passes.
        !           270: Instead of writing the trees onto an intermediate file,
        !           271: each tree is transformed in place into an acceptable
        !           272: form for the code generator.
        !           273: The code generator then writes the result of compiling
        !           274: this tree onto the standard output.
        !           275: Instead of `[' and `]' lines in the intermediate file, the information is passed
        !           276: directly to the second pass routines.
        !           277: Assembly code produced by the first pass is simply written out,
        !           278: without the need for `)' at the head of each line.
        !           279: .SH
        !           280: The Source Files
        !           281: .PP
        !           282: The compiler source consists of 25 source files.
        !           283: Several header files contain information
        !           284: which is needed across various source modules.
        !           285: .I Manifest.h
        !           286: has declarations for node types, type manipulation
        !           287: macros and other macros, and some global data definitions.
        !           288: .I Macdefs.h
        !           289: has machine-dependent definitions, such as the size and alignment of the
        !           290: various data representations.
        !           291: .I Config.h
        !           292: defines symbols which control the configuration of the compiler,
        !           293: including such things as the sizes of various tables
        !           294: and whether the compiler is ``one pass''.
        !           295: The compiler conditionally includes another file,
        !           296: .I onepass.h ,
        !           297: which contains definitions which are particular to
        !           298: a ``one pass'' compiler.
        !           299: .I Ndu.h
        !           300: defines the basic tree building structure
        !           301: which is used throughout the compiler
        !           302: to construct expression trees.
        !           303: .I Manifest.h
        !           304: includes a file of opcode and type definitions named
        !           305: .I pcclocal.h ;
        !           306: this file is automatically generated from
        !           307: a header file specific to the C compiler named
        !           308: .I localdefs.h
        !           309: and a public header file
        !           310: .I /usr/include/pcc.h .
        !           311: Another file,
        !           312: .I pcctokens ,
        !           313: is generated in a similar way
        !           314: and contains token definitions for the compiler's Yacc
        !           315: .[
        !           316: Johnson Yacc Compiler cstr
        !           317: .]
        !           318: grammar.
        !           319: Two machine independent header files,
        !           320: .I pass1.h
        !           321: and
        !           322: .I pass2.h ,
        !           323: contain the data structure and manifest definitions
        !           324: for the first and second passes, respectively.
        !           325: In the second pass, a machine dependent header file,
        !           326: .I mac2defs.h ,
        !           327: contains declarations of register names, etc.
        !           328: .PP
        !           329: .I Common.c
        !           330: contains machine independent routines used in both passes.
        !           331: These include routines for allocating and freeing trees, walking over trees,
        !           332: printing debugging information, and printing error messages.
        !           333: This file can be compiled in two flavors,
        !           334: one for pass 1 and one for pass 2,
        !           335: depending on what conditional compilation symbol is used.
        !           336: .PP
        !           337: Entire sections of this document are devoted to the detailed structure of the
        !           338: passes.
        !           339: For the moment, we just give a brief description of the files.
        !           340: The first pass
        !           341: is obtained by compiling and loading
        !           342: .I cgram.y ,
        !           343: .I code.c ,
        !           344: .I common.c ,
        !           345: .I local.c ,
        !           346: .I optim.c ,
        !           347: .I pftn.c ,
        !           348: .I scan.c ,
        !           349: .I stab.c ,
        !           350: .I trees.c
        !           351: and
        !           352: .I xdefs.c .
        !           353: .I Scan.c
        !           354: is the lexical analyzer, which provides tokens to the
        !           355: bottom-up parser which is defined by the Yacc grammar
        !           356: .I cgram.y .
        !           357: .I Xdefs.c
        !           358: is a short file of external definitions.
        !           359: .I Pftn.c
        !           360: maintains the symbol table, and does initialization.
        !           361: .I Trees.c
        !           362: builds the expression trees, and computes the node types.
        !           363: .I Optim.c
        !           364: does some machine independent optimizations on the expression trees.
        !           365: .I Common.c
        !           366: contains service routines common to the two passes of the compiler.
        !           367: All the above files are machine independent.
        !           368: The files
        !           369: .I local.c
        !           370: and
        !           371: .I code.c
        !           372: contain machine dependent code for generating subroutine
        !           373: prologs, switch code, and the like.
        !           374: .I Stab.c
        !           375: contains machine dependent code for
        !           376: producing external symbol table information
        !           377: which can drive a symbolic debugger.
        !           378: .PP
        !           379: The second pass
        !           380: is produced by compiling and loading
        !           381: .I allo.c ,
        !           382: .I common.c ,
        !           383: .I local2.c ,
        !           384: .I match.c ,
        !           385: .I order.c ,
        !           386: .I reader.c
        !           387: and
        !           388: .I table.c .
        !           389: .I Reader.c
        !           390: reads the intermediate file, and controls the major logic of the
        !           391: code generation.
        !           392: .I Allo.c
        !           393: keeps track of busy and free registers.
        !           394: .I Match.c
        !           395: controls the matching
        !           396: of code templates to subtrees of the expression tree to be compiled.
        !           397: .I Common.c
        !           398: defines certain service routines,
        !           399: as in the first pass.
        !           400: The above files are machine independent.
        !           401: .I Order.c
        !           402: controls the machine dependent details of the code generation strategy.
        !           403: .I Local2.c
        !           404: has many small machine dependent routines,
        !           405: and tables of opcodes, register types, etc.
        !           406: .I Table.c
        !           407: has the code template tables,
        !           408: which are also clearly machine dependent.
        !           409: .SH
        !           410: Data Structure Considerations
        !           411: .PP
        !           412: This section discusses the node numbers, type words, and expression trees, used
        !           413: throughout both passes of the compiler.
        !           414: .PP
        !           415: The file
        !           416: .I manifest.h
        !           417: defines those symbols used throughout both passes.
        !           418: The intent is to use the same symbol name (e.g., MINUS)
        !           419: for the given operator throughout the lexical analysis, parsing, tree building,
        !           420: and code generation phases.
        !           421: .I Manifest.h
        !           422: obtains some of its definitions from
        !           423: two other header files,
        !           424: .I localdefs.h
        !           425: and
        !           426: .I pcc.h .
        !           427: .I Localdefs.h
        !           428: contains definitions for operator symbols
        !           429: which are specific to the C compiler.
        !           430: .I Pcc.h
        !           431: contains definitions for operators and types
        !           432: which may be used by other compilers
        !           433: to communicate with a portable code generator
        !           434: based on pass 2;
        !           435: this code generator will be described later.
        !           436: .PP
        !           437: A token
        !           438: like MINUS may be seen in the lexical analyzer before it is known
        !           439: whether it is a unary or binary operator;
        !           440: clearly, it is necessary to know this by the time the parse tree
        !           441: is constructed.
        !           442: Thus, an operator (really a macro) called
        !           443: UNARY is provided, so that MINUS
        !           444: and UNARY MINUS are both distinct node numbers.
        !           445: Similarly, many binary operators exist in an assignment form (for example, \-=),
        !           446: and the operator ASG may be applied to such node names to generate new ones, e.g. ASG MINUS.
        !           447: .PP
        !           448: It is frequently desirable to know if a node represents a leaf (no descendants), a unary operator (one
        !           449: descendant) or a binary operator (two descendants).
        !           450: The macro
        !           451: .I optype(o)
        !           452: returns one of the manifest constants LTYPE, UTYPE, or BITYPE, respectively, depending
        !           453: on the node number
        !           454: .I o .
        !           455: Similarly,
        !           456: .I asgop(o)
        !           457: returns true if
        !           458: .I o
        !           459: is an assignment operator number (=, +=, etc. ), and
        !           460: .I logop(o)
        !           461: returns true if
        !           462: .I o
        !           463: is a relational or logical (&&, \(or\(or, or !) operator.
        !           464: .PP
        !           465: C has a rich typing structure, with a potentially infinite number of types.
        !           466: To begin with, there are the basic types: CHAR, SHORT, INT, LONG, the unsigned versions
        !           467: known as
        !           468: UCHAR, USHORT, UNSIGNED, ULONG, and FLOAT, DOUBLE,
        !           469: and finally
        !           470: STRTY (a structure), UNIONTY, and ENUMTY.
        !           471: Then, there are three operators that can be applied to types to make others:
        !           472: if
        !           473: .I t
        !           474: is a type, we may potentially have types
        !           475: .I "pointer to t" ,
        !           476: .I "function returning t" ,
        !           477: and
        !           478: .I "array of t's"
        !           479: generated from
        !           480: .I t .
        !           481: Thus, an arbitrary type in C consists of a basic type, and zero or more of these operators.
        !           482: .PP
        !           483: In the compiler, a type is represented by an unsigned integer; the rightmost four bits hold the basic
        !           484: type, and the remaining bits are divided into two-bit fields, containing
        !           485: 0 (no operator), or one of the three operators
        !           486: described above.
        !           487: The modifiers are read right to left in the word, starting with the two-bit
        !           488: field adjacent to the basic type, until a field with 0 in it is reached.
        !           489: The macros PTR, FTN, and ARY represent the
        !           490: .I "pointer to" ,
        !           491: .I "function returning" ,
        !           492: and
        !           493: .I "array of"
        !           494: operators.
        !           495: The macro values are shifted so that they align with the first two-bit
        !           496: field; thus PTR+INT
        !           497: represents the type for an integer pointer, and
        !           498: .DS
        !           499: ARY + (PTR<<2) + (FTN<<4) + DOUBLE
        !           500: .DE
        !           501: represents the type of an array of pointers to functions returning \fBdouble\fPs.
        !           502: .PP
        !           503: The type words are ordinarily manipulated by macros.
        !           504: If
        !           505: .I t
        !           506: is a type word,
        !           507: .I BTYPE(t)
        !           508: gives the basic type.
        !           509: .I ISPTR(t) ,
        !           510: .I ISARY(t) ,
        !           511: and
        !           512: .I ISFTN(t)
        !           513: ask if an object of this type is a pointer, array, or a function,
        !           514: respectively.
        !           515: .I MODTYPE(t,b)
        !           516: sets the basic type of
        !           517: .I t
        !           518: to
        !           519: .I b .
        !           520: .I DECREF(t)
        !           521: gives the type resulting from removing the first operator from
        !           522: .I t .
        !           523: Thus, if
        !           524: .I t
        !           525: is a pointer to
        !           526: .I t\(fm ,
        !           527: a function returning
        !           528: .I t\(fm ,
        !           529: or an array of
        !           530: .I t\(fm ,
        !           531: then
        !           532: .I DECREF(t)
        !           533: would equal
        !           534: .I t\(fm .
        !           535: .I INCREF(t)
        !           536: gives the type representing a pointer to
        !           537: .I t .
        !           538: Finally, there are operators for dealing with the unsigned types.
        !           539: .I ISUNSIGNED(t)
        !           540: returns true if
        !           541: .I t
        !           542: is one of the four basic unsigned types;
        !           543: in this case,
        !           544: .I DEUNSIGN(t)
        !           545: gives the associated `signed' type.
        !           546: Similarly,
        !           547: .I UNSIGNABLE(t)
        !           548: returns true if
        !           549: .I t
        !           550: is one of the four basic types that could become unsigned, and
        !           551: .I ENUNSIGN(t)
        !           552: returns the unsigned analogue of
        !           553: .I t
        !           554: in this case.
        !           555: .PP
        !           556: The other important global data structure is that of expression trees.
        !           557: The actual shapes of the nodes are given in
        !           558: .I ndu.h .
        !           559: The information stored for each pass is not quite the same;
        !           560: in the first pass, nodes contain
        !           561: dimension and size information, while in the second pass
        !           562: nodes contain register allocation information.
        !           563: Nevertheless, all nodes contain fields called
        !           564: .I op ,
        !           565: containing the node number,
        !           566: and
        !           567: .I type ,
        !           568: containing the type word.
        !           569: A function called
        !           570: .I talloc()
        !           571: returns a pointer to a new tree node.
        !           572: To free a node, its
        !           573: .I op
        !           574: field need merely be set to FREE.
        !           575: The other fields in the node will remain intact at least until the next allocation.
        !           576: .PP
        !           577: Nodes representing binary operators contain fields,
        !           578: .I left
        !           579: and
        !           580: .I right ,
        !           581: that contain pointers to the left and right descendants.
        !           582: Unary operator nodes have the
        !           583: .I left
        !           584: field, and a value field called
        !           585: .I rval .
        !           586: Leaf nodes, with no descendants, have two value fields:
        !           587: .I lval
        !           588: and
        !           589: .I rval .
        !           590: .PP
        !           591: At appropriate times, the function
        !           592: .I tcheck()
        !           593: can be called, to check that there are no busy nodes remaining.
        !           594: This is used as a compiler consistency check.
        !           595: The function
        !           596: .I tcopy(p)
        !           597: takes a pointer
        !           598: .I p
        !           599: that points to an expression tree, and returns a pointer
        !           600: to a disjoint copy of the tree.
        !           601: The function
        !           602: .I walkf(p,f)
        !           603: performs a postorder walk of the tree pointed to by
        !           604: .I p ,
        !           605: and applies the function
        !           606: .I f
        !           607: to each node.
        !           608: The function
        !           609: .I fwalk(p,f,d)
        !           610: does a preorder walk of the tree pointed to by
        !           611: .I p .
        !           612: At each node, it calls a function
        !           613: .I f ,
        !           614: passing to it the node pointer, a value passed down
        !           615: from its ancestor, and two pointers to values to be passed
        !           616: down to the left and right descendants (if any).
        !           617: The value
        !           618: .I d
        !           619: is the value passed down to the root.
        !           620: .I Fwalk
        !           621: is used for a number of tree labeling and debugging activities.
        !           622: .PP
        !           623: The other major data structure, the symbol table, exists only in pass one,
        !           624: and will be discussed later.
        !           625: .SH
        !           626: Pass One
        !           627: .PP
        !           628: The first pass does lexical analysis, parsing, symbol table maintenance,
        !           629: tree building, optimization, and a number of machine dependent things.
        !           630: This pass is largely machine independent, and the machine independent
        !           631: sections can be pretty successfully ignored.
        !           632: Thus, they will be only sketched here.
        !           633: .SH
        !           634: Lexical Analysis
        !           635: .PP
        !           636: The lexical analyzer is a conceptually simple routine that reads the input
        !           637: and returns the tokens of the C language as it encounters them:
        !           638: names, constants, operators, and keywords.
        !           639: The conceptual simplicity of this job is confounded a bit by several
        !           640: other simple jobs that unfortunately must go on simultaneously.
        !           641: These include
        !           642: .IP \(bu
        !           643: Keeping track of the current filename and line number, and occasionally
        !           644: setting this information as the result of preprocessor control lines.
        !           645: .IP \(bu
        !           646: Skipping comments.
        !           647: .IP \(bu
        !           648: Properly dealing with octal, decimal, hex, floating
        !           649: point, and character constants, as well as character strings.
        !           650: .PP
        !           651: To achieve speed, the program maintains several tables
        !           652: that are indexed into by character value, to tell
        !           653: the lexical analyzer what to do next.
        !           654: To achieve portability, these tables must be initialized
        !           655: each time the compiler is run, in order that the
        !           656: table entries reflect the local character set values.
        !           657: .SH
        !           658: Parsing
        !           659: .PP
        !           660: As mentioned above, the parser is generated by Yacc from the grammar
        !           661: .I cgram.y.
        !           662: The grammar is relatively readable, but contains some unusual features
        !           663: that are worth comment.
        !           664: .PP
        !           665: Perhaps the strangest feature of the grammar is the treatment of
        !           666: declarations.
        !           667: The problem is to keep track of the basic type
        !           668: and the storage class while interpreting the various
        !           669: stars, brackets, and parentheses that may surround a given name.
        !           670: The entire declaration mechanism must be recursive,
        !           671: since declarations may appear within declarations of
        !           672: structures and unions, or even within a
        !           673: .B sizeof
        !           674: construction
        !           675: inside a dimension in another declaration!
        !           676: .PP
        !           677: There are some difficulties in using a bottom-up parser,
        !           678: such as produced by Yacc, to handle constructions where a lot
        !           679: of left context information must be kept around.
        !           680: The problem is that the original PDP-11 compiler is top-down in
        !           681: implementation, and some of the semantics of C reflect this.
        !           682: In a top-down parser, the input rules are restricted
        !           683: somewhat, but one can naturally associate temporary
        !           684: storage with a rule at a very early stage in the recognition
        !           685: of that rule.
        !           686: In a bottom-up parser, there is more freedom in the
        !           687: specification of rules, but it is more difficult to know what
        !           688: rule is being matched until the entire rule is seen.
        !           689: The parser described by
        !           690: .I cgram.y
        !           691: makes effective use of
        !           692: the bottom-up parsing mechanism in some places (notably the treatment
        !           693: of expressions), but struggles against the
        !           694: restrictions in others.
        !           695: The usual result is that it is necessary to run a stack of values
        !           696: ``on the side'', independent of the Yacc value stack,
        !           697: in order to be able to store and access information
        !           698: deep within inner constructions, where the relationship of the
        !           699: rules being recognized to the total picture is not yet clear.
        !           700: .PP
        !           701: In the case of declarations, the attribute information
        !           702: (type, etc.) for a declaration is carefully
        !           703: kept immediately to the left of the declarator
        !           704: (that part of the declaration involving the name).
        !           705: In this way, when it is time to declare the name, the
        !           706: name and the type information can be quickly brought together.
        !           707: The ``$0'' mechanism of Yacc is used to accomplish this.
        !           708: The result is not pretty, but it works.
        !           709: The storage class information changes more slowly,
        !           710: so it is kept in an external variable, and stacked if
        !           711: necessary.
        !           712: Some of the grammar could be considerably cleaned up by using
        !           713: some more recent features of Yacc, notably actions within
        !           714: rules and the ability to return multiple values for actions.
        !           715: .PP
        !           716: A stack is also used to keep track of the current location
        !           717: to be branched to when a
        !           718: .B break
        !           719: or
        !           720: .B continue
        !           721: statement is processed.
        !           722: .PP
        !           723: This use of external stacks dates from the time when Yacc did not permit
        !           724: values to be structures.
        !           725: Some, or most, of this use of external stacks
        !           726: could be eliminated by redoing the grammar to use the mechanisms now provided.
        !           727: There are some areas, however, particularly the processing of structure, union,
        !           728: and enumeration declarations, function
        !           729: prologs, and switch statement processing, when having
        !           730: all the affected data together in an array speeds later
        !           731: processing; in this case, use of external storage seems essential.
        !           732: .PP
        !           733: The
        !           734: .I cgram.y
        !           735: file also contains some small functions used as
        !           736: utility functions in the parser.
        !           737: These include routines for saving case values and labels in processing switches,
        !           738: and stacking and popping 
        !           739: values on the external stack described above.
        !           740: .SH
        !           741: Storage Classes
        !           742: .PP
        !           743: C has a finite, but fairly extensive, number of storage classes
        !           744: available.
        !           745: One of the compiler design decisions was to
        !           746: process the storage class information totally in the first pass;
        !           747: by the second pass, this information must have
        !           748: been totally dealt with.
        !           749: This means that all of the storage allocation must take place in the first
        !           750: pass, so that references to automatics and
        !           751: parameters can be turned into references to cells lying a certain
        !           752: number of bytes offset from certain machine registers.
        !           753: Much of this transformation is machine dependent, and strongly
        !           754: depends on the storage class.
        !           755: .PP
        !           756: The classes include EXTERN (for externally declared, but not defined variables),
        !           757: EXTDEF (for external definitions), and similar distinctions for
        !           758: USTATIC and STATIC, UFORTRAN and FORTRAN (for fortran functions) and
        !           759: ULABEL and LABEL.
        !           760: The storage classes REGISTER and AUTO are obvious, as
        !           761: are STNAME, UNAME, and ENAME (for structure, union, and enumeration
        !           762: tags), and the associated MOS, MOU, and MOE (for the members).
        !           763: TYPEDEF is treated as a storage class as well.
        !           764: There are two special storage classes: PARAM and SNULL.
        !           765: SNULL is used to distinguish the case where no explicit
        !           766: storage class has been given; before an entry is made
        !           767: in the symbol table the true storage class is discovered.
        !           768: Similarly, PARAM is used for the temporary entry in the symbol
        !           769: table made before the declaration of function parameters is completed.
        !           770: .PP
        !           771: The most complexity in the storage class process comes from
        !           772: bit fields.
        !           773: A separate storage class is kept for each width bit field;
        !           774: a
        !           775: .I k
        !           776: bit bit field has storage class
        !           777: .I k
        !           778: plus FIELD.
        !           779: This enables the size to be quickly recovered from the storage class.
        !           780: .SH
        !           781: Symbol Table Maintenance
        !           782: .PP
        !           783: The symbol table routines do far more than simply enter
        !           784: names into the symbol table;
        !           785: considerable semantic processing and checking is done as well.
        !           786: For example, if a new declaration comes in, it must be checked
        !           787: to see if there is a previous declaration of the same symbol.
        !           788: If there is, there are many cases.
        !           789: The declarations may agree and be compatible (for example,
        !           790: an \fBextern\fP declaration can appear twice) in which case the
        !           791: new declaration is ignored.
        !           792: The new declaration may add information (such as an explicit array
        !           793: dimension) to an already present declaration.
        !           794: The new declaration may be different, but still correct (for example,
        !           795: an \fBextern\fP declaration of something may be entered,
        !           796: and then later the definition may be seen).
        !           797: The new declaration may be incompatible, but appear in an inner block;
        !           798: in this case, the old declaration is carefully hidden away, and
        !           799: the new one comes into force until the block is left.
        !           800: Finally, the declarations may be incompatible, and an error
        !           801: message must be produced.
        !           802: .PP
        !           803: A number of other factors make for additional complexity.
        !           804: The type declared by the user is not always the type
        !           805: entered into the symbol table (for example, if a formal parameter to a function
        !           806: is declared to be an array, C requires that this be changed into
        !           807: a pointer before entry in the symbol table).
        !           808: Moreover, there are various kinds of illegal types that
        !           809: may be declared which are difficult to
        !           810: check for syntactically (for example,
        !           811: a function returning an array).
        !           812: Finally, there is a strange feature in C that requires
        !           813: structure tag names and member names for structures and unions
        !           814: to be taken from a different logical symbol table than ordinary identifiers.
        !           815: Keeping track of which kind of name is involved is a bit of struggle
        !           816: (consider typedef names used within structure declarations, for example).
        !           817: .PP
        !           818: The symbol table handling routines have been rewritten a number of times to
        !           819: extend features, improve performance, and fix bugs.
        !           820: They address the above problems with reasonable effectiveness
        !           821: but a singular lack of grace.
        !           822: .PP
        !           823: When a name is read in the input, it is hashed, and the
        !           824: routine
        !           825: .I lookup
        !           826: is called, together with a flag which tells which symbol table
        !           827: should be searched (actually, both symbol tables are stored in one, and a flag
        !           828: is used to distinguish individual entries).
        !           829: If the name is found,
        !           830: .I lookup
        !           831: returns the index to the entry found; otherwise, it makes
        !           832: a new entry, marks it UNDEF (undefined), and returns the
        !           833: index of the new entry.
        !           834: This index is stored in the
        !           835: .I rval
        !           836: field of a NAME node.
        !           837: .PP
        !           838: When a declaration is being parsed, this NAME node is
        !           839: made part of a tree with UNARY MUL nodes for each \fB*\fP,
        !           840: LB nodes for each array descriptor (the right descendant
        !           841: has the dimension), and UNARY CALL nodes for each function
        !           842: descriptor.
        !           843: This tree is passed to the routine
        !           844: .I tymerge ,
        !           845: along with the attribute type of the whole declaration;
        !           846: this routine collapses the tree to a single node, by calling
        !           847: .I tyreduce ,
        !           848: and then modifies the type to reflect the overall
        !           849: type of the declaration.
        !           850: .PP
        !           851: Dimension and size information is stored in a table called
        !           852: .I dimtab .
        !           853: To properly describe a type in C, one needs not just the
        !           854: type information but also size information (for structures and
        !           855: enumerations) and dimension information (for arrays).
        !           856: Sizes and offsets are dealt with in the compiler by
        !           857: giving the associated indices into
        !           858: .I dimtab .
        !           859: .I Tymerge
        !           860: and
        !           861: .I tyreduce
        !           862: call
        !           863: .I dstash
        !           864: to put the discovered dimensions away into the
        !           865: .I dimtab
        !           866: array.
        !           867: .I Tymerge
        !           868: returns a pointer to a single node that contains
        !           869: the symbol table index in its
        !           870: .I rval
        !           871: field, and the size and dimension indices in
        !           872: fields
        !           873: .I csiz
        !           874: and
        !           875: .I cdim ,
        !           876: respectively.
        !           877: This information is properly considered part of the type in the first pass,
        !           878: and is carried around at all times.
        !           879: .PP
        !           880: To enter an element into the symbol table, the routine
        !           881: .I defid
        !           882: is called; it is handed a storage class, and a pointer
        !           883: to the node produced by
        !           884: .I tymerge .
        !           885: .I Defid
        !           886: calls
        !           887: .I fixtype ,
        !           888: which adjusts and checks the given type depending on the storage
        !           889: class, and converts null types appropriately.
        !           890: It then calls
        !           891: .I fixclass ,
        !           892: which does a similar job for the storage class;
        !           893: it is here, for example, that
        !           894: \fBregister\fP declarations are either allowed or changed
        !           895: to \fBauto\fP.
        !           896: .PP
        !           897: The new declaration is now compared against an older one,
        !           898: if present, and several pages of validity checks performed.
        !           899: If the definitions are compatible, with possibly some added information,
        !           900: the processing is straightforward.
        !           901: If the definitions differ, the block levels of the
        !           902: current and the old declaration are compared.
        !           903: The current block level is kept in
        !           904: .I blevel ,
        !           905: an external variable; the old declaration level is kept in the symbol table.
        !           906: Block level 0 is for external declarations, 1 is for arguments to functions,
        !           907: and 2 and above are blocks within a function.
        !           908: If the current block level is the same as the old declaration, an error
        !           909: results.
        !           910: If the current block level is higher, the new declaration overrides the old.
        !           911: This is done by marking the old symbol table entry ``hidden'', and making
        !           912: a new entry, marked ``hiding''.
        !           913: .I Lookup
        !           914: will skip over hidden entries.
        !           915: When a block is left, the symbol table is searched,
        !           916: and any entries defined in that block are destroyed; if
        !           917: they hid other entries, the old entries are ``unhidden''.
        !           918: .PP
        !           919: This nice block structure is warped a bit because labels
        !           920: do not follow the block structure rules (one can do a
        !           921: .B goto
        !           922: into
        !           923: a block, for example);
        !           924: default definitions of functions in inner blocks also persist clear out to the outermost scope.
        !           925: This implies that cleaning up the symbol table after block exit is more
        !           926: subtle than it might first seem.
        !           927: .PP
        !           928: For successful new definitions,
        !           929: .I defid
        !           930: also initializes a ``general purpose'' field,
        !           931: .I offset ,
        !           932: in the symbol table.
        !           933: It contains the stack offset for automatics and parameters, the register number
        !           934: for register variables, the bit offset into the structure for
        !           935: structure members, and
        !           936: the internal label number for static variables and labels.
        !           937: The offset field is set by
        !           938: .I falloc
        !           939: for bit fields, and
        !           940: .I dclstruct
        !           941: for structures and unions.
        !           942: .PP
        !           943: The symbol table entry itself thus contains
        !           944: the name, type word, size and dimension offsets,
        !           945: offset value, and declaration block level.
        !           946: It also has a field of flags, describing what symbol table the
        !           947: name is in, and whether the entry is hidden, or hides another.
        !           948: Finally, a field gives the line number of the last use,
        !           949: or of the definition, of the name.
        !           950: This is used mainly for diagnostics, but is useful to
        !           951: .I lint
        !           952: as well.
        !           953: .PP
        !           954: In some special cases, there is more than the above amount of information kept
        !           955: for the use of the compiler.
        !           956: This is especially true with structures; for use in initialization,
        !           957: structure declarations must have access to a list of the members of the
        !           958: structure.
        !           959: This list is also kept in
        !           960: .I dimtab .
        !           961: Because a structure can be mentioned long before the
        !           962: members are known, it is necessary to have another
        !           963: level of indirection in the table.
        !           964: The two words following the
        !           965: .I csiz
        !           966: entry in
        !           967: .I dimtab
        !           968: are used to hold the alignment of the structure, and
        !           969: the index in dimtab of the list of members.
        !           970: This list contains the symbol table indices for the structure members, terminated by a
        !           971: \-1.
        !           972: .SH
        !           973: Tree Building
        !           974: .PP
        !           975: The portable compiler transforms expressions
        !           976: into expression trees.
        !           977: As the parser recognizes each rule making up an
        !           978: expression,
        !           979: it calls
        !           980: .I buildtree
        !           981: which is given an operator number, and pointers to the
        !           982: left and right descendants.
        !           983: .I Buildtree
        !           984: first examines the left and right descendants,
        !           985: and, if they are both constants, and the operator
        !           986: is appropriate, simply does the constant
        !           987: computation at compile time, and returns
        !           988: the result as a constant.
        !           989: Otherwise,
        !           990: .I buildtree
        !           991: allocates a node for the head of the tree,
        !           992: attaches the descendants to it, and ensures that
        !           993: conversion operators are generated if needed, and that
        !           994: the type of the new node is consistent with the types
        !           995: of the operands.
        !           996: There is also a considerable amount of semantic complexity here;
        !           997: many combinations of types are illegal, and the portable
        !           998: compiler makes a strong effort to check
        !           999: the legality of expression types completely.
        !          1000: This is done both for
        !          1001: .I lint
        !          1002: purposes, and to prevent such semantic errors
        !          1003: from being passed through to the code generator.
        !          1004: .PP
        !          1005: The heart of
        !          1006: .I buildtree
        !          1007: is a large table, accessed by the routine
        !          1008: .I opact .
        !          1009: This routine maps the types of the left and right
        !          1010: operands into a rather smaller set of descriptors, and then
        !          1011: accesses a table (actually encoded in a switch statement) which
        !          1012: for each operator and pair of types causes
        !          1013: an action to be returned.
        !          1014: The actions are logical or's of a number of
        !          1015: separate actions, which may be
        !          1016: carried out by
        !          1017: .I buildtree .
        !          1018: These component actions may include
        !          1019: checking the left side to ensure that it
        !          1020: is an lvalue (can be stored into),
        !          1021: applying a type conversion to the left or right operand,
        !          1022: setting the type of the new node to
        !          1023: the type of the left or right operand, calling various
        !          1024: routines to balance the types of the left and right operands,
        !          1025: and suppressing the ordinary conversion
        !          1026: of arrays and function operands to pointers.
        !          1027: An important operation is OTHER, which causes
        !          1028: some special code to be invoked
        !          1029: in
        !          1030: .I buildtree ,
        !          1031: to handle issues which are unique to a particular operator.
        !          1032: Examples of this are
        !          1033: structure and union reference
        !          1034: (actually handled by
        !          1035: the routine
        !          1036: .I stref ),
        !          1037: the building of NAME, ICON, STRING and FCON (floating point constant) nodes,
        !          1038: unary \fB*\fP and \fB&\fP, structure assignment, and calls.
        !          1039: In the case of unary \fB*\fP and \fB&\fP,
        !          1040: .I buildtree
        !          1041: will cancel a \fB*\fP applied to a tree, the top node of which
        !          1042: is \fB&\fP, and conversely.
        !          1043: .PP
        !          1044: Another special operation is
        !          1045: PUN; this
        !          1046: causes the compiler to check for type mismatches,
        !          1047: such as intermixing pointers and integers.
        !          1048: .PP
        !          1049: The treatment of conversion operators is a rather
        !          1050: strange area of the compiler (and of C!).
        !          1051: The introduction of type casts only confounded this
        !          1052: situation.
        !          1053: Most of the conversion operators are generated by
        !          1054: calls to
        !          1055: .I tymatch
        !          1056: and
        !          1057: .I ptmatch ,
        !          1058: both of which are given a tree, and asked to make the
        !          1059: operands agree in type.
        !          1060: .I Ptmatch
        !          1061: treats the case where one of the operands is a pointer;
        !          1062: .I tymatch
        !          1063: treats all other cases.
        !          1064: Where these routines have decided on the
        !          1065: proper type for an operand, they call
        !          1066: .I makety ,
        !          1067: which is handed a tree, and a type word, dimension offset, and size offset.
        !          1068: If necessary, it inserts a conversion operation to make the
        !          1069: types correct.
        !          1070: Conversion operations are never inserted on the left side of
        !          1071: assignment operators, however.
        !          1072: There are two conversion operators used;
        !          1073: PCONV, if the conversion is to a non-basic type (usually a pointer),
        !          1074: and
        !          1075: SCONV, if the conversion is to a basic type (scalar).
        !          1076: .PP
        !          1077: To allow for maximum flexibility, every node produced by
        !          1078: .I buildtree
        !          1079: is given to a machine dependent routine,
        !          1080: .I clocal ,
        !          1081: immediately after it is produced.
        !          1082: This is to allow more or less immediate rewriting of those
        !          1083: nodes which must be adapted for the local machine.
        !          1084: The conversion operations are given to
        !          1085: .I clocal
        !          1086: as well; on most machines, many of these
        !          1087: conversions do nothing, and should be thrown away
        !          1088: (being careful to retain the type).
        !          1089: If this operation is done too early,
        !          1090: however,
        !          1091: later calls to
        !          1092: .I buildtree
        !          1093: may get confused about correct type of the
        !          1094: subtrees;
        !          1095: thus
        !          1096: .I clocal
        !          1097: is given the conversion operations only after the entire tree is built.
        !          1098: This topic will be dealt with in more detail later.
        !          1099: .SH
        !          1100: Initialization
        !          1101: .PP
        !          1102: Initialization is one of the messier areas in the portable compiler.
        !          1103: The only consolation is that most of the mess takes place
        !          1104: in the machine independent part, where it
        !          1105: is may be safely ignored by the implementor of the
        !          1106: compiler for a particular machine.
        !          1107: .PP
        !          1108: The basic problem is that the semantics of initialization
        !          1109: really calls for a co-routine structure; one collection
        !          1110: of programs reading constants from the input stream, while another,
        !          1111: independent set of programs places these constants into the
        !          1112: appropriate spots in memory.
        !          1113: The dramatic differences in the local assemblers also
        !          1114: come to the fore here.
        !          1115: The parsing problems are dealt with by keeping a rather
        !          1116: extensive stack containing the current
        !          1117: state of the initialization; the assembler
        !          1118: problems are dealt with by having a fair number of machine dependent routines.
        !          1119: .PP
        !          1120: The stack contains the symbol table number,
        !          1121: type, dimension index, and size index for the current identifier
        !          1122: being initialized.
        !          1123: Another entry has the offset, in bits, of the beginning
        !          1124: of the current identifier.
        !          1125: Another entry keeps track of how many elements have been seen,
        !          1126: if the current identifier is an array.
        !          1127: Still another entry keeps track of the current
        !          1128: member of a structure being initialized.
        !          1129: Finally, there is an entry containing flags
        !          1130: which keep track of the current state of the initialization process
        !          1131: (e.g., tell if a `}' has been seen for the current identifier).
        !          1132: .PP
        !          1133: When an initialization begins, the routine
        !          1134: .I beginit
        !          1135: is called; it handles the alignment restrictions, if
        !          1136: any, and calls
        !          1137: .I instk
        !          1138: to create the stack entry.
        !          1139: This is done by first making an entry on the top of the stack for the item
        !          1140: being initialized.
        !          1141: If the top entry is an array, another entry is made on the stack
        !          1142: for the first element.
        !          1143: If the top entry is a structure, another entry is made on the
        !          1144: stack for the first member of the structure.
        !          1145: This continues until the top element of the stack is a scalar.
        !          1146: .I Instk
        !          1147: then returns, and the parser begins collecting initializers.
        !          1148: .PP
        !          1149: When a constant is obtained, the routine
        !          1150: .I doinit
        !          1151: is called; it examines the stack, and does whatever
        !          1152: is necessary to assign the current constant to the
        !          1153: scalar on the top of the stack.
        !          1154: .I gotscal
        !          1155: is then called, which rearranges the stack so that the
        !          1156: next scalar to be initialized gets placed on top of the stack.
        !          1157: This process continues until the end of the initializers;
        !          1158: .I endinit
        !          1159: cleans up.
        !          1160: If a `{' or `}' is encountered in the
        !          1161: string of initializers, it is handled by calling
        !          1162: .I ilbrace
        !          1163: or
        !          1164: .I irbrace ,
        !          1165: respectively.
        !          1166: .PP
        !          1167: A central issue is the treatment of the ``holes'' that
        !          1168: arise as a result of alignment restrictions or explicit
        !          1169: requests for holes in bit fields.
        !          1170: There is a global variable,
        !          1171: .I inoff ,
        !          1172: which contains the current offset in the initialization
        !          1173: (all offsets in the first pass of the compiler are in bits).
        !          1174: .I Doinit
        !          1175: figures out from the top entry on the stack the expected
        !          1176: bit offset of the next identifier; it calls the
        !          1177: machine dependent
        !          1178: routine
        !          1179: .I inforce
        !          1180: which, in a machine dependent way, forces
        !          1181: the assembler to
        !          1182: set aside space if need be so that the
        !          1183: next scalar seen will go into the appropriate
        !          1184: bit offset position.
        !          1185: The scalar itself is passed to one of the machine dependent
        !          1186: routines
        !          1187: .I fincode 
        !          1188: (for floating point initialization),
        !          1189: .I incode
        !          1190: (for fields, and other initializations less than an int in
        !          1191: size),
        !          1192: and
        !          1193: .I cinit
        !          1194: (for all other initializations).
        !          1195: The size is passed to all these routines, and it is up to
        !          1196: the machine dependent routines to ensure that the
        !          1197: initializer occupies exactly the right size.
        !          1198: .PP
        !          1199: Character strings represent a bit of an exception.
        !          1200: If a character string is seen as the initializer for
        !          1201: a pointer, the characters making up the string must
        !          1202: be put out under a different location counter.
        !          1203: When the lexical analyzer sees the quote at the head
        !          1204: of a character string, it returns the token STRING,
        !          1205: but does not do anything with the contents.
        !          1206: The parser calls
        !          1207: .I getstr ,
        !          1208: which sets up the appropriate location counters
        !          1209: and flags, and calls
        !          1210: .I lxstr
        !          1211: to read and process the contents of the string.
        !          1212: .PP
        !          1213: If the string is being used to initialize a character array,
        !          1214: .I lxstr
        !          1215: calls
        !          1216: .I putbyte ,
        !          1217: which in effect simulates
        !          1218: .I doinit
        !          1219: for each character read.
        !          1220: If the string is used to initialize a character pointer,
        !          1221: .I lxstr
        !          1222: calls a machine dependent routine,
        !          1223: .I bycode ,
        !          1224: which stashes away each character.
        !          1225: The pointer to this string is then returned,
        !          1226: and processed normally by
        !          1227: .I doinit .
        !          1228: .PP
        !          1229: The null at the end of the string is treated as if it
        !          1230: were read explicitly by
        !          1231: .I lxstr .
        !          1232: .SH
        !          1233: Statements
        !          1234: .PP
        !          1235: The first pass addresses four main areas; declarations, expressions, initialization, and statements.
        !          1236: The statement processing is relatively simple; most of it is carried out in the
        !          1237: parser directly.
        !          1238: Most of the logic is concerned with allocating
        !          1239: label numbers, defining the labels, and branching appropriately.
        !          1240: An external symbol,
        !          1241: .I reached ,
        !          1242: is 1 if a statement can be reached, 0 otherwise; this is
        !          1243: used to do a bit of simple flow analysis as the program is being parsed,
        !          1244: and also to avoid generating the subroutine return sequence if the subroutine
        !          1245: cannot ``fall through'' the last statement.
        !          1246: .PP
        !          1247: Conditional branches are handled by generating an expression
        !          1248: node, CBRANCH,
        !          1249: whose left descendant is the conditional expression and the
        !          1250: right descendant is an ICON node containing the internal label
        !          1251: number to be branched to.
        !          1252: For efficiency, the semantics are that
        !          1253: the label is gone to if the condition is
        !          1254: .I false .
        !          1255: .PP
        !          1256: The switch statement is 
        !          1257: compiled by collecting the case entries, and an indication as to whether
        !          1258: there is a default case;
        !          1259: an internal label number is generated for each of these,
        !          1260: and remembered in a big array.
        !          1261: The expression comprising the value to be switched on is
        !          1262: compiled when the switch keyword is encountered,
        !          1263: but the expression tree is headed by
        !          1264: a special node, FORCE, which tells the code generator to
        !          1265: put the expression value into a special distinguished
        !          1266: register (this same mechanism is used for processing the
        !          1267: return statement).
        !          1268: When the end of the switch block is reached, the array
        !          1269: containing the case values is sorted, and checked for
        !          1270: duplicate entries (an error); if all is
        !          1271: correct, the machine dependent routine
        !          1272: .I genswitch
        !          1273: is called, with this array of labels and values in increasing order.
        !          1274: .I Genswitch
        !          1275: can assume that the value to be tested is already in the
        !          1276: register which is the usual integer return value register.
        !          1277: .SH
        !          1278: Optimization
        !          1279: .PP
        !          1280: There is a machine independent file,
        !          1281: .I optim.c ,
        !          1282: which contains a relatively short optimization routine,
        !          1283: .I optim .
        !          1284: Actually the word optimization is something of a misnomer;
        !          1285: the results are not optimum, only improved, and the
        !          1286: routine is in fact not optional; it must
        !          1287: be called for proper operation of the compiler.
        !          1288: .PP
        !          1289: .I Optim
        !          1290: is called after an expression tree is built, but
        !          1291: before the code generator is called.
        !          1292: The essential part of its job is to call
        !          1293: .I clocal
        !          1294: on the conversion operators.
        !          1295: On most machines, the treatment of
        !          1296: \fB&\fP is also essential:
        !          1297: by this time in the processing, the only node which
        !          1298: is a legal descendant of \fB&\fP is NAME.
        !          1299: (Possible descendants of \fB*\fP have been eliminated by
        !          1300: .I buildtree .)
        !          1301: The address of a static name is, almost by definition, a
        !          1302: constant, and can be represented by an ICON node on most machines
        !          1303: (provided that the loader has enough power).
        !          1304: Unfortunately, this is not universally true; on some machine, such as the IBM 370,
        !          1305: the issue of addressability rears its ugly head;
        !          1306: thus, before turning a NAME node into an ICON node,
        !          1307: the machine dependent function
        !          1308: .I andable
        !          1309: is called.
        !          1310: .PP
        !          1311: The optimization attempts of
        !          1312: .I optim
        !          1313: are quite limited.
        !          1314: It is primarily concerned with improving the behavior of
        !          1315: the compiler with operations one of whose arguments is a constant.
        !          1316: In the simplest case, the constant is placed on the right if the
        !          1317: operation is commutative.
        !          1318: The compiler also makes a limited search for expressions
        !          1319: such as
        !          1320: .DS
        !          1321: ( \fIx\fP + \fIa\fP ) + \fIb\fP
        !          1322: .DE
        !          1323: where
        !          1324: .I a
        !          1325: and
        !          1326: .I b
        !          1327: are constants, and attempts to combine
        !          1328: .I a
        !          1329: and
        !          1330: .I b
        !          1331: at compile time.
        !          1332: A number of special cases are also examined;
        !          1333: additions of 0 and multiplications by 1 are removed,
        !          1334: although the correct processing of these cases to get
        !          1335: the type of the resulting tree correct is
        !          1336: decidedly nontrivial.
        !          1337: In some cases, the addition or multiplication must be replaced by
        !          1338: a conversion operator to keep the types from becoming
        !          1339: fouled up.
        !          1340: In cases where a relational operation is being done
        !          1341: and one operand is a constant, the operands are permuted and the operator altered, if necessary,
        !          1342: to put the constant on the right.
        !          1343: Finally, multiplications by a power of 2 are changed to shifts.
        !          1344: .SH
        !          1345: Machine Dependent Stuff
        !          1346: .PP
        !          1347: A number of the first pass machine dependent routines have been discussed above.
        !          1348: In general, the routines are short, and easy to adapt from
        !          1349: machine to machine.
        !          1350: The two exceptions to this general rule are
        !          1351: .I clocal
        !          1352: and
        !          1353: the function prolog and epilog generation routines,
        !          1354: .I bfcode
        !          1355: and
        !          1356: .I efcode .
        !          1357: .PP
        !          1358: .I Clocal
        !          1359: has the job of rewriting, if appropriate and desirable,
        !          1360: the nodes constructed by
        !          1361: .I buildtree .
        !          1362: There are two major areas where this
        !          1363: is important:
        !          1364: NAME nodes and conversion operations.
        !          1365: In the case of NAME nodes,
        !          1366: .I clocal
        !          1367: must rewrite the NAME node to reflect the
        !          1368: actual physical location of the name in the machine.
        !          1369: In effect, the NAME node must be examined, the symbol table
        !          1370: entry found (through the
        !          1371: .I rval
        !          1372: field of the node),
        !          1373: and, based on the storage class of the node,
        !          1374: the tree must be rewritten.
        !          1375: Automatic variables and parameters are typically
        !          1376: rewritten by treating the reference to the variable as
        !          1377: a structure reference, off the register which
        !          1378: holds the stack or argument pointer;
        !          1379: the
        !          1380: .I stref
        !          1381: routine is set up to be called in this way, and to
        !          1382: build the appropriate tree.
        !          1383: In the most general case, the tree consists
        !          1384: of a unary \fB*\fP node, whose descendant is
        !          1385: a \fB+\fP node, with the stack or argument register as left operand,
        !          1386: and a constant offset as right operand.
        !          1387: In the case of LABEL and internal static nodes, the
        !          1388: .I rval
        !          1389: field is rewritten to be the negative of the internal
        !          1390: label number; a negative
        !          1391: .I rval 
        !          1392: field is taken to be an internal label number.
        !          1393: Finally, a name of class REGISTER must be converted into a REG node,
        !          1394: and the
        !          1395: .I rval
        !          1396: field replaced by the register number.
        !          1397: In fact, this part of the
        !          1398: .I clocal
        !          1399: routine is nearly machine independent; only for machines
        !          1400: with addressability problems (IBM 370 again!) does it
        !          1401: have to be noticeably different.
        !          1402: .PP
        !          1403: The conversion operator treatment is rather tricky.
        !          1404: It is necessary to handle the application of conversion operators
        !          1405: to constants in
        !          1406: .I clocal ,
        !          1407: in order that all constant expressions can have their values known
        !          1408: at compile time.
        !          1409: In extreme cases, this may mean that some simulation of the
        !          1410: arithmetic of the target machine might have to be done in a
        !          1411: cross-compiler.
        !          1412: In the most common case,
        !          1413: conversions from pointer to pointer do nothing.
        !          1414: For some machines, however, conversion from byte pointer to short or long
        !          1415: pointer might require a shift or rotate operation, which would
        !          1416: have to be generated here.
        !          1417: .PP
        !          1418: The extension of the portable compiler to machines where the size of a pointer
        !          1419: depends on its type would be straightforward, but has not yet been done.
        !          1420: .PP
        !          1421: Another machine dependent issue in the first pass
        !          1422: is the generation of external ``symbol table'' information.
        !          1423: This sort of symbol table is used by programs such as symbolic debuggers
        !          1424: to relate object code back to source code.
        !          1425: Symbol table routines are provided in
        !          1426: the file \fIstab.c\fP,
        !          1427: which is included in the machine dependent sources
        !          1428: for the first pass.
        !          1429: The symbol table routines insert assembly code
        !          1430: containing assembly pseudo-ops directly into
        !          1431: the instruction stream generated by the compiler.
        !          1432: .PP
        !          1433: There are two basic kinds of symbol table operations.
        !          1434: The simplest operation is the generation of a source line number;
        !          1435: this serves to map an address in an executable image
        !          1436: into a line in a source file so that a debugger
        !          1437: can find the source code
        !          1438: corresponding to the instructions being executed.
        !          1439: The routine \fIpsline\fP is called by the scanner
        !          1440: to emit source line numbers when
        !          1441: a nonempty source line is seen.
        !          1442: The other variety of symbol table operation
        !          1443: is the generation of type and address information
        !          1444: about C symbols.
        !          1445: This is done through the \fIoutstab\fP routine,
        !          1446: which is normally called using the FIXDEF macro
        !          1447: in the monster \fIdefid\fP routine in \fIpftn.c\fP
        !          1448: that enters symbols into the compiler's internal symbol table.
        !          1449: .PP
        !          1450: Yet another major machine dependent issue involves function prolog and epilog
        !          1451: generation.
        !          1452: The hard part here is the design of the stack frame
        !          1453: and calling sequence; this design issue is discussed elsewhere.
        !          1454: .[
        !          1455: Johnson Lesk Ritchie calling sequence
        !          1456: .]
        !          1457: The routine
        !          1458: .I bfcode
        !          1459: is called with the number of arguments
        !          1460: the function is defined with, and
        !          1461: an array containing the symbol table indices of the
        !          1462: declared parameters.
        !          1463: .I Bfcode
        !          1464: must generate the code to establish the new stack frame,
        !          1465: save the return address and previous stack pointer
        !          1466: value on the stack, and save whatever
        !          1467: registers are to be used for register variables.
        !          1468: The stack size and the number of register variables is not
        !          1469: known when
        !          1470: .I bfcode
        !          1471: is called, so these numbers must be
        !          1472: referred to by assembler constants, which are
        !          1473: defined when they are known (usually in the second pass,
        !          1474: after all register variables, automatics, and temporaries have been seen).
        !          1475: The final job is to find those parameters which may have been declared
        !          1476: register, and generate the code to initialize
        !          1477: the register with the value passed on the stack.
        !          1478: Once again, for most machines, the general logic of
        !          1479: .I bfcode
        !          1480: remains the same, but the contents of the
        !          1481: .I printf
        !          1482: calls in it will change from machine to machine.
        !          1483: .I efcode
        !          1484: is rather simpler, having just to generate the default
        !          1485: return at the end of a function.
        !          1486: This may be nontrivial in the case of a function returning a structure or union, however.
        !          1487: .PP
        !          1488: There seems to be no really good place to discuss structures and unions, but
        !          1489: this is as good a place as any.
        !          1490: The C language now supports structure assignment,
        !          1491: and the passing of structures as arguments to functions,
        !          1492: and the receiving of structures back from functions.
        !          1493: This was added rather late to C, and thus to the portable compiler.
        !          1494: Consequently, it fits in less well than the older features.
        !          1495: Moreover, most of the burden of making these features work is
        !          1496: placed on the machine dependent code.
        !          1497: .PP
        !          1498: There are both conceptual and practical problems.
        !          1499: Conceptually, the compiler is structured around
        !          1500: the idea that to compute something, you put it into
        !          1501: a register and work on it.
        !          1502: This notion causes a bit of trouble on some machines (e.g., machines with 3-address opcodes), but
        !          1503: matches many machines quite well.
        !          1504: Unfortunately, this notion breaks down with structures.
        !          1505: The closest that one can come is to keep the addresses of the
        !          1506: structures in registers.
        !          1507: The actual code sequences used to move structures vary from the
        !          1508: trivial (a multiple byte move) to the horrible (a
        !          1509: function call), and are very machine dependent.
        !          1510: .PP
        !          1511: The practical problem is more painful.
        !          1512: When a function returning a structure is called, this function
        !          1513: has to have some place to put the structure value.
        !          1514: If it places it on the stack, it has difficulty popping its stack frame.
        !          1515: If it places the value in a static temporary, the routine fails to be
        !          1516: reentrant.
        !          1517: The most logically consistent way of implementing this is for the
        !          1518: caller to pass in a pointer to a spot where the called function
        !          1519: should put the value before returning.
        !          1520: This is relatively straightforward, although a bit tedious, to implement,
        !          1521: but means that the caller must have properly declared
        !          1522: the function type, even if the value is never used.
        !          1523: On some machines, such as the Interdata 8/32, the return value
        !          1524: simply overlays the argument region (which on the 8/32 is part
        !          1525: of the caller's stack frame).
        !          1526: The caller takes care of leaving enough room if the returned value is larger
        !          1527: than the arguments.
        !          1528: This also assumes that the caller declares the
        !          1529: function properly.
        !          1530: .PP
        !          1531: The PDP-11 and the \*V have stack hardware which is used in function calls and returns;
        !          1532: this makes it very inconvenient to
        !          1533: use either of the above mechanisms.
        !          1534: In these machines, a static area within the called function
        !          1535: is allocated, and
        !          1536: the function return value is copied into it on return; the function
        !          1537: returns the address of that region.
        !          1538: This is simple to implement, but is non-reentrant.
        !          1539: However, the function can now be called as a subroutine
        !          1540: without being properly declared, without the disaster which would otherwise ensue.
        !          1541: No matter what choice is taken, the convention is that the function
        !          1542: actually returns the address of the return structure value.
        !          1543: .PP
        !          1544: In building expression trees, the portable compiler takes a bit for granted about
        !          1545: structures.
        !          1546: It assumes that functions returning structures
        !          1547: actually return a pointer to the structure, and it assumes that
        !          1548: a reference to a structure is actually a reference to its address.
        !          1549: The structure assignment operator is rebuilt so that the left
        !          1550: operand is the structure being assigned to, but the
        !          1551: right operand is the address of the structure being assigned;
        !          1552: this makes it easier to deal with
        !          1553: .DS
        !          1554: .I "a = b = c"
        !          1555: .DE
        !          1556: and similar constructions.
        !          1557: .PP
        !          1558: There are four special tree nodes associated with these
        !          1559: operations:
        !          1560: STASG (structure assignment), STARG (structure argument
        !          1561: to a function call), and STCALL and UNARY STCALL
        !          1562: (calls of a function with nonzero and zero arguments, respectively).
        !          1563: These four nodes are unique in that the size and alignment information, which can be determined by
        !          1564: the type for all other objects in C, 
        !          1565: must be known to carry out these operations; special
        !          1566: fields are set aside in these nodes to contain
        !          1567: this information, and special
        !          1568: intermediate code is used to transmit this information.
        !          1569: .SH
        !          1570: First Pass Summary
        !          1571: .PP
        !          1572: There are may other issues which have been ignored here,
        !          1573: partly to justify the title ``tour'', and partially
        !          1574: because they have seemed to cause little trouble.
        !          1575: There are some debugging flags
        !          1576: which may be turned on, by giving the compiler's first pass
        !          1577: the argument
        !          1578: .DS
        !          1579: .B \-X [flags]
        !          1580: .DE
        !          1581: Some of the more interesting flags are
        !          1582: \fB\-Xd\fP for the defining and freeing of symbols,
        !          1583: \fB\-Xi\fP for initialization comments, and
        !          1584: \fB\-Xb\fP for various comments about the building of trees.
        !          1585: In many cases, repeating the flag more than once gives more information;
        !          1586: thus,
        !          1587: \fB\-Xddd\fP gives more information than \fB\-Xd\fP.
        !          1588: In the two pass version of the compiler, the
        !          1589: flags should not be set when the output is sent to the second
        !          1590: pass, since the debugging output and the intermediate code both go onto the standard
        !          1591: output.
        !          1592: .PP
        !          1593: We turn now to consideration of the second pass.

unix.superglobalmegacorp.com

This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.