|
|
1.1.1.6 root 1: 0. Improved efficiency.
2:
3: * Parse and output array initializers an element at a time, freeing
4: storage after each, instead of parsing the whole initializer first and
5: then outputting. This would reduce memory usage for large
6: initializers.
7:
1.1 root 8: 1. Better optimization.
9:
1.1.1.4 root 10: * Constants in unused inline functions
11:
12: It would be nice to delay output of string constants so that string
13: constants mentioned in unused inline functions are never generated.
14: Perhaps this would also take care of string constants in dead code.
15:
16: The difficulty is in finding a clean way for the RTL which refers
17: to the constant (currently, only by an assembler symbol name)
18: to point to the constant and cause it to be output.
19:
1.1 root 20: * More cse
21:
1.1.1.7 ! root 22: The techniques for doing full global cse are described in the red
! 23: dragon book, or (a different version) in Frederick Chow's thesis from
! 24: Stanford. It is likely to be slow and use a lot of memory, but it
! 25: might be worth offering as an additional option.
1.1 root 26:
27: It is probably possible to extend cse to a few very frequent cases
28: without so much expense.
29:
30: For example, it is not very hard to handle cse through if-then
31: statements with no else clauses. Here's how to do it. On reaching a
32: label, notice that the label's use-count is 1 and that the last
33: preceding jump jumps conditionally to this label. Now you know it
34: is a simple if-then statement. Remove from the hash table
35: all the expressions that were entered since that jump insn
36: and you can continue with cse.
37:
38: It is probably not hard to handle cse from the end of a loop
39: around to the beginning, and a few loops would be greatly sped
40: up by this.
41:
1.1.1.3 root 42: * Support more general tail-recursion among different functions.
1.1 root 43:
1.1.1.3 root 44: This might be possible under certain circumstances, such as when
45: the argument lists of the functions have the same lengths.
46: Perhaps it could be done with a special declaration.
47:
48: You would need to verify in the calling function that it does not
49: use the addresses of any local variables and does not use setjmp.
50:
51: * Put short statics vars at low addresses and use short addressing mode?
1.1.1.7 ! root 52:
1.1.1.3 root 53: Useful on the 68000/68020 and perhaps on the 32000 series,
54: provided one has a linker that works with the feature.
55: This is said to make a 15% speedup on the 68000.
1.1.1.7 ! root 56:
! 57: * Keep global variables in registers.
! 58:
! 59: Here is a scheme for doing this. A global variable, or a local variable
! 60: whose address is taken, can be kept in a register for an entire function
! 61: if it does not use non-constant memory addresses and (for globals only)
! 62: does not call other functions. If the entire function does not meet
! 63: this criterion, a loop may.
! 64:
! 65: The VAR_DECL for such a variable would have to have two RTL expressions:
! 66: the true home in memory, and the pseudo-register used temporarily.
! 67: It is necessary to emit insns to copy the memory location into the
! 68: pseudo-register at the beginning of the function or loop, and perhaps
! 69: back out at the end. These insns should have REG_EQUIV notes so that,
! 70: if the pseudo-register does not get a hard register, it is spilled into
! 71: the memory location which exists in any case.
! 72:
! 73: The easiest way to set up these insns is to modify the routine
! 74: put_var_into_stack so that it does not apply to the entire function
! 75: (sparing any loops which contain nothing dangerous) and to call it at
! 76: the end of the function regardless of where in the function the
! 77: address of a local variable is taken. It would be called
! 78: unconditionally at the end of the function for all relevant global
! 79: variables.
! 80:
! 81: For debugger output, the thing to do is to invent a new binding level
! 82: around the appropriate loop and define the variable name as a register
! 83: variable with that scope.
! 84:
! 85: * Live-range splitting.
! 86:
! 87: Currently a variable is allocated a hard register either for the full
! 88: extent of its use or not at all. Sometimes it would be good to
! 89: allocate a variable a hard register for just part of a function; for
! 90: example, through a particular loop where the variable is mostly used,
! 91: or outside of a particular loop where the variable is not used. (The
! 92: latter is nice because it might let the variable be in a register most
! 93: of the time even though the loop needs all the registers.)
! 94:
! 95: It might not be very hard to do this in global-alloc.c when a variable
! 96: fails to get a hard register for its entire life span.
! 97:
! 98: The first step is to find a loop in which the variable is live, but
! 99: which is not the whole life span or nearly so. It's probably best to
! 100: use a loop in which the variable is heavily used.
! 101:
! 102: Then create a new pseudo-register to represent the variable in that loop.
! 103: Substitute this for the old pseudo-register there, and insert move insns
! 104: to copy between the two at the loop entry and all exits. (When several
! 105: such moves are inserted at the same place, some new feature should be
! 106: added to say that none of those registers conflict merely because of
! 107: overlap between the new moves. And the reload pass should reorder them
! 108: so that a store precedes a load, for any given hard register.)
! 109:
! 110: After doing this for all the reasonable candidates, run global-alloc
! 111: over again. With luck, one of the two pseudo-registers will be fit
! 112: somewhere. It may even have a much higher priority due to its reduced
! 113: life span.
! 114:
! 115: There will be no room in general for the new pseudo-registers in
! 116: basic_block_live_at_start, so there will need to be a second such
! 117: matrix exclusively for the new ones. Various other vectors indexed by
! 118: register number will have to be made bigger, or there will have to be
! 119: secondary extender vectors just for global-alloc.
! 120:
! 121: A simple new feature could arrange that both pseudo-registers get the
! 122: same stack slot if they both fail to get hard registers.
! 123:
! 124: Other compilers split live ranges when they are not connected, or
! 125: try to split off pieces `at the edge'. I think splitting around loops
! 126: will provide more speedup.
! 127:
! 128: Creating a fake binding block and a new like-named variable with
! 129: shorter life span and different address might succeed in describing
! 130: this technique for the debugger.
1.1.1.3 root 131:
132: * Detect dead stores into memory?
133:
134: A store into memory is dead if it is followed by another store into
1.1.1.5 root 135: the same location; and, in between, there is no reference to anything
136: that might be that location (including no reference to a variable
137: address).
1.1.1.3 root 138:
139: * Loop optimization.
140:
141: Strength reduction and iteration variable elimination could be
142: smarter. They should know how to decide which iteration variables are
143: not worth making explicit because they can be computed as part of an
144: address calculation. Based on this information, they should decide
1.1.1.5 root 145: when it is desirable to eliminate one iteration variable and create
1.1.1.3 root 146: another in its place.
147:
148: It should be possible to compute what the value of an iteration
149: variable will be at the end of the loop, and eliminate the variable
150: within the loop by computing that value at the loop end.
151:
152: When a loop has a simple increment that adds 1,
153: instead of jumping in after the increment,
154: decrement the loop count and jump to the increment.
155: This allows aob insns to be used.
1.1 root 156:
1.1.1.2 root 157: * Using constraints on values.
1.1 root 158:
1.1.1.2 root 159: Many operations could be simplified based on knowledge of the
160: minimum and maximum possible values of a register at any particular time.
161: These limits could come from the data types in the tree, via rtl generation,
162: or they can be deduced from operations that are performed. For example,
163: the result of an `and' operation one of whose operands is 7 must be in
164: the range 0 to 7. Compare instructions also tell something about the
165: possible values of the operand, in the code beyond the test.
166:
167: Value constraints can be used to determine the results of a further
168: comparison. They can also indicate that certain `and' operations are
169: redundant. Constraints might permit a decrement and branch
170: instruction that checks zeroness to be used when the user has
171: specified to exit if negative.
1.1 root 172:
173: * Smarter reload pass.
174:
175: The reload pass as currently written can reload values only into registers
176: that are reserved for reloading. This means that in order to use a
177: register for reloading it must spill everything out of that register.
178:
179: It would be straightforward, though complicated, for reload1.c to keep
180: track, during its scan, of which hard registers were available at each
181: point in the function, and use for reloading even registers that were
182: free only at the point they were needed. This would avoid much spilling
183: and make better code.
184:
185: * Change the type of a variable.
186:
187: Sometimes a variable is declared as `int', it is assigned only once
188: from a value of type `char', and then it is used only by comparison
189: against constants. On many machines, better code would result if
190: the variable had type `char'. If the compiler could detect this
191: case, it could change the declaration of the variable and change
192: all the places that use it.
193:
194: * Order of subexpressions.
195:
196: It might be possible to make better code by paying attention
197: to the order in which to generate code for subexpressions of an expression.
198:
1.1.1.3 root 199: * More code motion.
1.1 root 200:
1.1.1.3 root 201: Consider hoisting common code up past conditional branches or
202: tablejumps.
203:
204: * Trace scheduling.
205:
206: This technique is said to be able to figure out which way a jump
207: will usually go, and rearrange the code to make that path the
208: faster one.
1.1 root 209:
210: * Distributive law.
211:
1.1.1.2 root 212: The C expression *(X + 4 * (Y + C)) compiles better on certain
213: machines if rewritten as *(X + 4*C + 4*Y) because of known addressing
214: modes. It may be tricky to determine when, and for which machines, to
215: use each alternative.
216:
1.1.1.3 root 217: Some work has been done on this, in combine.c.
218:
1.1.1.2 root 219: * Jump-execute-next.
220:
221: Many recent machines have jumps which optionally execute the following
222: instruction before the instruction jumped to, either conditionally or
223: unconditionally. To take advantage of this capability requires a new
224: compiler pass that would reorder instructions when possible. After
225: reload may be a good place for it.
226:
227: On some machines, the result of a load from memory is not available
228: until after the following instruction. The easiest way to support
229: these machines is to output each RTL load instruction as two assembler
230: instructions, the second being a no-op. Putting useful instructions
231: after the load instructions may be a similar task to putting them
232: after jump instructions.
233:
234: * Pipeline scheduling.
235:
236: On many machines, code gets faster if instructions are reordered
237: so that pipelines are kept full. Doing the best possible job of this
238: requires knowing which functional units each kind of instruction executes
239: in and how long the functional unit stays busy with it. Then the
240: goal is to reorder the instructions to keep many functional units
241: busy but never feed them so fast they must wait.
1.1 root 242:
1.1.1.3 root 243: * Can optimize by changing if (x) y; else z; into z; if (x) y;
244: if z and x do not interfere and z has no effects not undone by y.
245: This is desirable if z is faster than jumping.
246:
247: * For a two-insn loop on the 68020, such as
248: foo: movb a2@+,a3@+
249: jne foo
250: it is better to insert dbeq d0,foo before the jne.
251: d0 can be a junk register. The challenge is to fit this into
252: a portable framework: when can you detect this situation and
253: still be able to allocate a junk register?
254:
1.1.1.6 root 255: * For the 80387 floating point, perhaps it would be possible to use 3
256: or 4 registers in the stack to hold register variables. (It would be
257: necessary to keep track of how those slots move in the stack as other
258: pushes and pops are done.) This is probably very tricky, but if
259: you are a GCC wizard and you care about the speed of floating point on
260: an 80386, you might want to work on it.
261:
1.1 root 262: 2. Simpler porting.
263:
264: Right now, describing the target machine's instructions is done
265: cleanly, but describing its addressing mode is done with several
266: ad-hoc macro definitions. Porting would be much easier if there were
267: an RTL description for addressing modes like that for instructions.
268: Tools analogous to genflags and genrecog would generate macros from
269: this description.
270:
271: There would be one pattern in the address-description file for each
272: kind of addressing, and this pattern would have:
273:
274: * the RTL expression for the address
275: * C code to verify its validity (since that may depend on
276: the exact data).
277: * C code to print the address in assembler language.
278: * C code to convert the address into a valid one, if it is not valid.
279: (This would replace LEGITIMIZE_ADDRESS).
280: * Register constraints for all indeterminates that appear
281: in the RTL expression.
282:
283: 3. Other languages.
284:
1.1.1.2 root 285: Front ends for Pascal, Fortran, Algol, Cobol, Modula-2 and Ada are
286: desirable.
1.1 root 287:
1.1.1.2 root 288: Pascal, Modula-2 and Ada require the implementation of functions
289: within functions. Some of the mechanisms for this already exist.
1.1 root 290:
1.1.1.7 ! root 291: 4. More extensions.
! 292:
! 293: * Label-addresses as expressions.
! 294:
! 295: It would be nice to have access to the addresses of labels; to be able to
! 296: store them in variables, or initialize vectors of them.
! 297:
! 298: Alas, `&label0' is the address of the variable named label0, which is
! 299: unrelated to the label with that name. Some other syntax is needed.
! 300: Perhaps colon as a unary operator? That is ambiguous with `?:' with
! 301: the middle operand omitted. Perhaps ^ as a unary operator? Perhaps
! 302: `__label__ label0' could mean the value of label0? Its type could be
! 303: `void *'. `goto *EXP' could be used to go to a value of type `void
! 304: *'--no ambiguity there.
! 305:
! 306: Jump optimization and flow analysis must know about computed jumps,
! 307: but that is not hard. Each basic block headed by a possible target of
! 308: computed jumps must be considered a successor of each basic block
! 309: ending in a computed jump. Aside from this, I believe no other
! 310: optimizer changes are needed.
! 311:
! 312: Next question: stack levels. In most functions, there is no problem,
! 313: but it would be a shame to make a feature that doesn't work together
! 314: with other features. Here is an idea:
! 315:
! 316: For each label that might need stack level restoration, construct a
! 317: shadow-label which will restore the stack and jump to the user-label.
! 318: Then use the address of the shadow label for label0 when someone asks
! 319: for that of label0. Jump optimization will delete all the shadow labels
! 320: if the function has no computed gotos.
! 321:
! 322: * Block structure for labels.
! 323:
! 324: The ({...}) construct should serve as a lexical block for label names,
! 325: so that the same label may be defined both inside and outside of it.
! 326: Then macro definitions could use labels internally safely, by enclosing
! 327: the label and the goto in one of these constructs.
! 328:
! 329: * Generated unique labels. Have some way of generating distinct labels
! 330: for use in extended asm statements. I don't know what a good syntax would
! 331: be.
! 332:
! 333: 5. Generalize the machine model.
1.1 root 334:
1.1.1.4 root 335: * Some new compiler features may be needed to do a good job on machines
1.1.1.2 root 336: where static data needs to be addressed using base registers.
1.1 root 337:
1.1.1.4 root 338: * Some machines have two stacks in different areas of memory, one used
1.1.1.2 root 339: for scalars and another for large objects. The compiler does not
340: now have a way to understand this.
1.1.1.3 root 341:
1.1.1.7 ! root 342: 6. Precompilation of header files.
1.1.1.3 root 343:
344: In the future, many programs will use thousands of lines of header files.
345: Compiling the headers might be slower than compiling the guts of any one
346: source file. Here is a scheme for precompiling header files to make
347: compilation faster for a sequence of headers which is often used.
348:
349: A precompiled header corresponds to a sequence of header files. The
350: preprocessor recognizes when the input starts with a sequence of
351: `#include' commands and searches a data base for a precompiled header
352: corresponding to that sequence. The modtimes of all these files are
353: stored in the data base so that one can tell whether the precompiled
354: header is still valid.
355:
356: For robustness, each directory should have its own collection of
357: precompiled headers and its own data base of them. Probably each
358: precompiled header would be a file and the data base would be one
359: more file.
360:
361: The data base records the entire collection of predefined macros and
362: their definitions, except for __FILE__, __LINE__ and __DATE__, for
363: each precompiled header. If this collection does not match the setup
364: at the start of the current compilation (including the results of -D
365: and -U switches), the precompiled header is inapplicable. It might
366: be possible to have distinct precompiled headers for the same sequence
367: of header files but different collections of predefined macros.
368:
369: The state of any option that affects macro processing, such as -ansi
370: or -traditional, must also be recorded, and the precompiled header is
371: valid only if these options match.
372:
373: The precompiled header contains an ordered series of strings. Some
374: strings are marked "unconditional"; these must be compiled each time
375: the precompiled header is used. Other strings are have keys, which
376: are identifiers. A string with keys must be compiled if at least one
377: of its keys is mentioned in the input. The order these strings appear
378: in the precompiled header is called their intrinsic order.
379:
380: The C preprocessor reads in the precompiled header file and scan all
381: the strings, making for each key an entry in the same symbol table
382: used for macros, pointing at a list of all the strings for which it is
383: a key. Each string must have a flag (one flag per string, not one per
384: key per string). The same code in `rescan' that detects references to
385: macros would detect a reference to a key and flag all of the strings
386: that it belongs to as needing to be output.
387:
388: Each of these strings is immediately recursively macroexpanded (i.e.
389: `rescan' is called), but the output from this is discarded. The
390: expansion is to detect any other keys mentioned in the string, and to
391: define any macros for which the string contains a #define. The key's
392: symbol table entry is be deleted to save time if the key is
393: encountered again, and to avoid an infinite recursion.
394:
395: The unconditional strings are macroexpanded with `rescan' (but the
396: output is discarded) at some time before anything is actually output.
397:
398: At the end of compilation, before any of the actual input text is
399: output, the list of strings is scanned in the intrinsic order, and
400: each string that is unconditional or flagged is output verbatim,
401: except that any #define lines are discarded.
402:
403: Precompiled headers would be constructed by explicit request with a
404: special tool. The first step is to run cpp on the sequence of header
405: files' contents. This would use a new option that would cause all
406: #define lines to be output unchanged as well as defining the macro.
407: The second step is to divide the output into strings, some keyed and
408: some unconditional. This division is done without changing the order
409: of the text being divided up.
1.1.1.4 root 410:
1.1.1.5 root 411: [email protected] has some ideas on this subject also.
412:
1.1.1.7 ! root 413: 7. Better documentation of how GCC works and how to port it.
1.1.1.4 root 414:
415: Here is an outline proposed by Allan Adler.
416:
417: I. Overview of this document
418: II. The machines on which GCC is implemented
419: A. Prose description of those characteristics of target machines and
420: their operating systems which are pertinent to the implementation
421: of GCC.
422: i. target machine characteristics
423: ii. comparison of this system of machine characteristics with
424: other systems of machine specification currently in use
425: B. Tables of the characteristics of the target machines on which
426: GCC is implemented.
427: C. A priori restrictions on the values of characteristics of target
428: machines, with special reference to those parts of the source code
429: which entail those restrictions
430: i. restrictions on individual characteristics
431: ii. restrictions involving relations between various characteristics
432: D. The use of GCC as a cross-compiler
433: i. cross-compilation to existing machines
434: ii. cross-compilation to non-existent machines
435: E. Assumptions which are made regarding the target machine
436: i. assumptions regarding the architecture of the target machine
437: ii. assumptions regarding the operating system of the target machine
438: iii. assumptions regarding software resident on the target machine
439: iv. where in the source code these assumptions are in effect made
440: III. A systematic approach to writing the files tm.h and xm.h
441: A. Macros which require special care or skill
442: B. Examples, with special reference to the underlying reasoning
443: IV. A systematic approach to writing the machine description file md
444: A. Minimal viable sets of insn descriptions
445: B. Examples, with special reference to the underlying reasoning
446: V. Uses of the file aux-output.c
447: VI. Specification of what constitutes correct performance of an
448: implementation of GCC
449: A. The components of GCC
450: B. The itinerary of a C program through GCC
451: C. A system of benchmark programs
1.1.1.5 root 452: D. What your RTL and assembler should look like with these benchmarks
1.1.1.4 root 453: E. Fine tuning for speed and size of compiled code
454: VII. A systematic procedure for debugging an implementation of GCC
455: A. Use of GDB
456: i. the macros in the file .gdbinit for GCC
457: ii. obstacles to the use of GDB
458: a. functions implemented as macros can't be called in GDB
459: B. Debugging without GDB
460: i. How to turn off the normal operation of GCC and access specific
461: parts of GCC
462: C. Debugging tools
463: D. Debugging the parser
464: i. how machine macros and insn definitions affect the parser
465: E. Debugging the recognizer
466: i. how machine macros and insn definitions affect the recognizer
467:
468: ditto for other components
469:
470: VIII. Data types used by GCC, with special reference to restrictions not
471: specified in the formal definition of the data type
472: IX. References to the literature for the algorithms used in GCC
473:
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.