--- gcc/PROJECTS 2018/04/24 16:52:25 1.1.1.6 +++ gcc/PROJECTS 2018/04/24 16:56:20 1.1.1.8 @@ -19,9 +19,10 @@ to point to the constant and cause it to * More cse -The techniques for doing full global cse are described in the -red dragon book. It is likely to be slow and use a lot of memory, -but it might be worth offering as an additional option. +The techniques for doing full global cse are described in the red +dragon book, or (a different version) in Frederick Chow's thesis from +Stanford. It is likely to be slow and use a lot of memory, but it +might be worth offering as an additional option. It is probably possible to extend cse to a few very frequent cases without so much expense. @@ -48,10 +49,85 @@ You would need to verify in the calling use the addresses of any local variables and does not use setjmp. * Put short statics vars at low addresses and use short addressing mode? + Useful on the 68000/68020 and perhaps on the 32000 series, provided one has a linker that works with the feature. This is said to make a 15% speedup on the 68000. -This brings to mind Hayes' changes for Stanford MIPS. + +* Keep global variables in registers. + +Here is a scheme for doing this. A global variable, or a local variable +whose address is taken, can be kept in a register for an entire function +if it does not use non-constant memory addresses and (for globals only) +does not call other functions. If the entire function does not meet +this criterion, a loop may. + +The VAR_DECL for such a variable would have to have two RTL expressions: +the true home in memory, and the pseudo-register used temporarily. +It is necessary to emit insns to copy the memory location into the +pseudo-register at the beginning of the function or loop, and perhaps +back out at the end. These insns should have REG_EQUIV notes so that, +if the pseudo-register does not get a hard register, it is spilled into +the memory location which exists in any case. + +The easiest way to set up these insns is to modify the routine +put_var_into_stack so that it does not apply to the entire function +(sparing any loops which contain nothing dangerous) and to call it at +the end of the function regardless of where in the function the +address of a local variable is taken. It would be called +unconditionally at the end of the function for all relevant global +variables. + +For debugger output, the thing to do is to invent a new binding level +around the appropriate loop and define the variable name as a register +variable with that scope. + +* Live-range splitting. + +Currently a variable is allocated a hard register either for the full +extent of its use or not at all. Sometimes it would be good to +allocate a variable a hard register for just part of a function; for +example, through a particular loop where the variable is mostly used, +or outside of a particular loop where the variable is not used. (The +latter is nice because it might let the variable be in a register most +of the time even though the loop needs all the registers.) + +It might not be very hard to do this in global-alloc.c when a variable +fails to get a hard register for its entire life span. + +The first step is to find a loop in which the variable is live, but +which is not the whole life span or nearly so. It's probably best to +use a loop in which the variable is heavily used. + +Then create a new pseudo-register to represent the variable in that loop. +Substitute this for the old pseudo-register there, and insert move insns +to copy between the two at the loop entry and all exits. (When several +such moves are inserted at the same place, some new feature should be +added to say that none of those registers conflict merely because of +overlap between the new moves. And the reload pass should reorder them +so that a store precedes a load, for any given hard register.) + +After doing this for all the reasonable candidates, run global-alloc +over again. With luck, one of the two pseudo-registers will be fit +somewhere. It may even have a much higher priority due to its reduced +life span. + +There will be no room in general for the new pseudo-registers in +basic_block_live_at_start, so there will need to be a second such +matrix exclusively for the new ones. Various other vectors indexed by +register number will have to be made bigger, or there will have to be +secondary extender vectors just for global-alloc. + +A simple new feature could arrange that both pseudo-registers get the +same stack slot if they both fail to get hard registers. + +Other compilers split live ranges when they are not connected, or +try to split off pieces `at the edge'. I think splitting around loops +will provide more speedup. + +Creating a fake binding block and a new like-named variable with +shorter life span and different address might succeed in describing +this technique for the debugger. * Detect dead stores into memory? @@ -140,30 +216,6 @@ use each alternative. Some work has been done on this, in combine.c. -* Jump-execute-next. - -Many recent machines have jumps which optionally execute the following -instruction before the instruction jumped to, either conditionally or -unconditionally. To take advantage of this capability requires a new -compiler pass that would reorder instructions when possible. After -reload may be a good place for it. - -On some machines, the result of a load from memory is not available -until after the following instruction. The easiest way to support -these machines is to output each RTL load instruction as two assembler -instructions, the second being a no-op. Putting useful instructions -after the load instructions may be a similar task to putting them -after jump instructions. - -* Pipeline scheduling. - -On many machines, code gets faster if instructions are reordered -so that pipelines are kept full. Doing the best possible job of this -requires knowing which functional units each kind of instruction executes -in and how long the functional unit stays busy with it. Then the -goal is to reorder the instructions to keep many functional units -busy but never feed them so fast they must wait. - * Can optimize by changing if (x) y; else z; into z; if (x) y; if z and x do not interfere and z has no effects not undone by y. This is desirable if z is faster than jumping. @@ -176,13 +228,6 @@ d0 can be a junk register. The challeng a portable framework: when can you detect this situation and still be able to allocate a junk register? -* For the 80387 floating point, perhaps it would be possible to use 3 -or 4 registers in the stack to hold register variables. (It would be -necessary to keep track of how those slots move in the stack as other -pushes and pops are done.) This is probably very tricky, but if -you are a GCC wizard and you care about the speed of floating point on -an 80386, you might want to work on it. - 2. Simpler porting. Right now, describing the target machine's instructions is done @@ -212,7 +257,42 @@ desirable. Pascal, Modula-2 and Ada require the implementation of functions within functions. Some of the mechanisms for this already exist. -4. Generalize the machine model. +4. More extensions. + +* Label-addresses as expressions. + +It would be nice to have access to the addresses of labels; to be able to +store them in variables, or initialize vectors of them. + +Alas, `&label0' is the address of the variable named label0, which is +unrelated to the label with that name. Some other syntax is needed. +Perhaps colon as a unary operator? That is ambiguous with `?:' with +the middle operand omitted. Perhaps ^ as a unary operator? Perhaps +`__label__ label0' could mean the value of label0? Its type could be +`void *'. `goto *EXP' could be used to go to a value of type `void +*'--no ambiguity there. + +Jump optimization and flow analysis must know about computed jumps, +but that is not hard. Each basic block headed by a possible target of +computed jumps must be considered a successor of each basic block +ending in a computed jump. Aside from this, I believe no other +optimizer changes are needed. + +Next question: stack levels. In most functions, there is no problem, +but it would be a shame to make a feature that doesn't work together +with other features. Here is an idea: + +For each label that might need stack level restoration, construct a +shadow-label which will restore the stack and jump to the user-label. +Then use the address of the shadow label for label0 when someone asks +for that of label0. Jump optimization will delete all the shadow labels +if the function has no computed gotos. + +* Generated unique labels. Have some way of generating distinct labels +for use in extended asm statements. I don't know what a good syntax would +be. + +5. Generalize the machine model. * Some new compiler features may be needed to do a good job on machines where static data needs to be addressed using base registers. @@ -221,77 +301,6 @@ where static data needs to be addressed for scalars and another for large objects. The compiler does not now have a way to understand this. -5. Precompilation of header files. - -In the future, many programs will use thousands of lines of header files. -Compiling the headers might be slower than compiling the guts of any one -source file. Here is a scheme for precompiling header files to make -compilation faster for a sequence of headers which is often used. - -A precompiled header corresponds to a sequence of header files. The -preprocessor recognizes when the input starts with a sequence of -`#include' commands and searches a data base for a precompiled header -corresponding to that sequence. The modtimes of all these files are -stored in the data base so that one can tell whether the precompiled -header is still valid. - -For robustness, each directory should have its own collection of -precompiled headers and its own data base of them. Probably each -precompiled header would be a file and the data base would be one -more file. - -The data base records the entire collection of predefined macros and -their definitions, except for __FILE__, __LINE__ and __DATE__, for -each precompiled header. If this collection does not match the setup -at the start of the current compilation (including the results of -D -and -U switches), the precompiled header is inapplicable. It might -be possible to have distinct precompiled headers for the same sequence -of header files but different collections of predefined macros. - -The state of any option that affects macro processing, such as -ansi -or -traditional, must also be recorded, and the precompiled header is -valid only if these options match. - -The precompiled header contains an ordered series of strings. Some -strings are marked "unconditional"; these must be compiled each time -the precompiled header is used. Other strings are have keys, which -are identifiers. A string with keys must be compiled if at least one -of its keys is mentioned in the input. The order these strings appear -in the precompiled header is called their intrinsic order. - -The C preprocessor reads in the precompiled header file and scan all -the strings, making for each key an entry in the same symbol table -used for macros, pointing at a list of all the strings for which it is -a key. Each string must have a flag (one flag per string, not one per -key per string). The same code in `rescan' that detects references to -macros would detect a reference to a key and flag all of the strings -that it belongs to as needing to be output. - -Each of these strings is immediately recursively macroexpanded (i.e. -`rescan' is called), but the output from this is discarded. The -expansion is to detect any other keys mentioned in the string, and to -define any macros for which the string contains a #define. The key's -symbol table entry is be deleted to save time if the key is -encountered again, and to avoid an infinite recursion. - -The unconditional strings are macroexpanded with `rescan' (but the -output is discarded) at some time before anything is actually output. - -At the end of compilation, before any of the actual input text is -output, the list of strings is scanned in the intrinsic order, and -each string that is unconditional or flagged is output verbatim, -except that any #define lines are discarded. - -Precompiled headers would be constructed by explicit request with a -special tool. The first step is to run cpp on the sequence of header -files' contents. This would use a new option that would cause all -#define lines to be output unchanged as well as defining the macro. -The second step is to divide the output into strings, some keyed and -some unconditional. This division is done without changing the order -of the text being divided up. - -JNC@lcs.mit.edu has some ideas on this subject also. - 6. Better documentation of how GCC works and how to port it. Here is an outline proposed by Allan Adler.