Jon Bentley, a distinguished Computer Scientist at AT&T Bell Labs, known for many publications, published Writing Efficient Programs. In this out-of-print classic, Bentley provides a unified, pragmatic treatment of program efficiency, independent of language and host platform.
For ease of presentation, Bentley codified his methods as a set of terse rules. The following is the list of Bentley's rules, edited from Appendix C of Writing Efficient Programs and adapted to suit the special needs and opportunities of the Origin 2000 architecture, SGI version 7.x compilers, and the MIPS instruction set.
The time required for common operations on data can often be reduced
by augmenting the structure with additional information, or by changing
the information within the structure so it can be accessed more
easily.
Examples:
For the Origin 2000, with a 100:1 ratio between instruction time and memory-fetch time, it is more important to minimize secondary cache misses than to minimize instruction execution. Special applications of this rule include:
In one example cited in a paper given at Supercomputing '96, most
of the cache misses of one program were caused by "a search
loop that accesses a four-byte
integer field in each element of an array of structures...By
storing the integer items in a separate, shadow array, the
secondary cache miss rate of the entire application is
reduced."
The cost of recomputing an expensive function can be reduced by computing the function only once and storing the results. Subsequent requests for the function are handled by table lookup.
Precomputation is almost certainly a win when the replaced computation involves disk access, or an IRIX system call, or even a substantial libc function. Before precomputing an ordinary calculation, consider that if retrieval of the precomputed result can lead to an additional cache miss, the replaced computation must cost at least 100 instructions to break even.
Data that is accessed most often should be the cheapest to access.
However, caching can backfire when locality is not in fact
present.
Examples:
The MIPS R10000, Origin 2000 node, and IRIX kernel already cache at a number of levels:
However, all these standard caches are shared by multiple processes (from the standpoint of your program, the other processes are "diluting" the cache with their data, driving out your data). Also, system cache management knows nothing about your algorithms. Application-dependent caching can yield huge payoffs, especially when it avoids disk access.
Never evaluate an item until it is needed.
Examples:
Dense storage representations can decrease storage costs by
increasing the time to store and retrieve data.
Typical examples
include packing multiple binary integers into 32- or 64-bit words.
The memory complement of a typical Origin 2000 server is large enough that you might think such bit-squeezing techniques should be relics of the past. However, is file access time a significant factor in your program's speed? (Use prof(1) to find out.) If the file format can be squeezed 50%, sequential access time is cut by that much, and direct access time is reduced due to shorter seek distances. Also, if a key dataset does not fit into cache, and by packing a data structure you can get all of it into cache, the extra instruction cycles to unpack each item are repaid many times over.
The space required to represent a program can often be decreased by
the use of interpreters in which common sequences of operations are
represented compactly.
A typical example is the use of a
finite-state machine to encode a complex protocol or lexical format
into a small table.
Finite-state machines, decision tables, and token interpreters can all yield elegant, maintainable algorithm designs. However, keep in mind that the coding techniques you use to implement an interpreter, in particular computed gotos and vector tables, tend to defeat the compiler's code manipulations. Don't expect the compiler to optimize or parallelize the interpreter's central loop.
Many of these loop-transformation rules have been incorporated into the SGI 7.x compilers. This makes it easier to apply them in controlled ways using compiler options.
An invariant expression (one whose value does not depend on the loop
variable) should be calculated once, outside the loop, rather than
iteratively.
But keep in mind:
for (i=0;i<Z;++i) { if (x[i]<(fact/div))...; }
At nonzero optimization levels the compiler will recognize
fact/div
as invariant, and move it to a generated temporary in the loop
prologue. You should leave the code in this naive but readable
form, and not clutter it with yet another scratch
variable.
for (i=0;i<Z;++i) { if (x[i]<func(fact,div))...; }
When func(fact,div) does not depend on i, you should move the call out of the loop:
for (i=0,t=func(fact,div);i<Z;++i) { if (x[i]<t)...; }
An exception is when you request function inlining. After the compiler inlines the function body, it may then recognize that some or all of the inlined code is invariant, and move those parts out of the loop.
An efficient inner loop should contain as few tests as possible, and
preferably only one. Try to combine exit conditions.
Examples:
This optimization can be especially helpful with the R10000. The R10000 can pipeline a single conditional branch, executing speculatively along the most likely arm. However, two conditional branches in quick succession can interfere with the pipeline and cause the CPU to skip cycles until the branch condition values are available.
A large cost of some short loops is in modifying loop indexes. That cost can often be reduced by unrolling the loop.
This optimization can bring large benefits, especially in a superscalar CPU like the R10000. However, loop unrolling is a touchy, error-prone operation to carry out by hand, and the resulting code is hard to read and to maintain. Fortunately, the 7.x compilers have extensive support for loop unrolling. You can control the amount of unrolling either with a compiler option or loop-by-loop with directives. The compiler takes care of the niggling details and the bookkeeping involved in handling end-of-loop, while the source code remains readable.
A benefit of compiler-controlled loop unrolling is that the compiler applies the optimizations of common subexpression elimination, constant propagation, code motion out of loops, and function call inlining both before and after unrolling the loop. The optimizations work together for a synergistic effect. The software-pipelining optimization (rearranging the order of the generated machine instructions in order to maximize use of the R10000 execution units) also becomes more effective when it has a longer loop body to work on.
When a large cost of an inner loop is devoted to trivial assignments (other than maintenance of loop indices), the assignments can be reduced by duplicating the loop body and renaming the variables in the second copy.
As with simple loop unrolling, this important optimization is well-handled by the compiler. Leave your code simple and readable and let the compiler do this.
A fast loop should have no unconditional branches. An unconditional branch at the end of a loop can be handled by "rotating" the loop to put a conditional branch at the bottom.
This optimization might be needed in assembly-language code, but no modern compiler should generate code for a loop with the test at the top and an unconditional branch at the bottom.
If two nearby loops operate on the same set of elements, combine
their operational parts and use only one set of loop-control
operations.
Example: convert
for (i=0;i<n;i++) for (j=0;j<n;j++) b[i][j] = a[i][j]; for (i=0;i<n;i++) for (j=0;j<n;j++) c[i][j] = b[i][j];
into the single loop
for (i=0;i<n;i++) for (j=0;j<n;j++) b[i][j] = a[i][j]; c[i][j] = b[i][j];
This technique can produce important benefits in the Origin 2000 because, first, the contents of array b are only passed through the cache once, and second, the instructions to load each element of b (from the second loop) are eliminated. However, it is not necessary for you to perform this optimization manually. The 7.x compilers perform loop fusion automatically at certain optimization levels. The compilers can also perform the following optimizations:
Each of these changes, if done manually, complicates the source code with many temporary variables and insanely complicated loop control. In general, write loop code in the simplest, most portable and maintainable fashion, and then let the compiler tangle it behind the scenes.
In a conditional expression, replace a costly expression with an
algebraically equivalent expression that is cheaper to evaluate.
Examples:
Any such replacements that require insight are worth doing. Do not spend time replacing simple manifest expressions with constants; the compiler is good at that, and good at recognizing common subexpressions.
When a monotone function of several variables is to be tested for a threshold value, break off the test as soon as the result is known.
Short-circuit evaluation of a Boolean expression is a well-known case (handled automatically by the compiler). Bentley gives a more interesting example. Given a function to find the point in a list that is nearest to a given point:
typedef struct point_s { double X, Y; } point_t; #include <math.h> int nearest_XY(int n, point_t given, point_t points[]) { int j, close_n=-1; double thisDist, closeDist=MAXFLOAT; for (j=0; j<n; ++j) { thisDist = sqr(points[j].X-given.X) + sqr(points[j].Y-given.Y); if (thisDist < closeDist) { closeDist = thisDist; close_n = j; } } return close_n; }
This function as it stands short-circuits the distance computation (for purposes of comparing two distances, it is sufficient to compare the sum of squares and skip doing the square root). However, in many cases the X-distance alone is enough to eliminate a point, which avoids the need for the second multiply and an addition.
int nearest_XZ(int n, point_t given, point_t points[]) { int j, close_n=-1; double thisDist, closeDist=MAXFLOAT; for (j=0; j<n; ++j) { thisDist = sqr(points[j].X-given.X); if (thisDist < closeDist) /* possible */ { thisDist += sqr(points[j].Y-given.Y); if (thisDist < closeDist) { closeDist = thisDist; close_n = j; } } } return close_n; }
Logical tests should be arranged such that cheap tests precede expensive ones, and so that ones most likely to succeed precede ones more likely to fail.
In the absence of other information, the SGI compilers assume that the then part of an if is more likely than the else, so it is good practice to put the most likely code in the then part.
It is possible to run the program under speedshop (experiment type ideal), producing a file of exact execution counts. This file can be filtered through prof (format -feedback) and the resulting file can be fed back into the next compilation. The compiler will take branch execution counts from the feedback file, and create an instruction schedule that reflects the exact probabilities of each branch.
This is a convenient feature that can have good results on a poorly-written program. However, the compiler does not take into account the cost of a test that involves a function call, or a test that can touch an unused cache line or page. Your insight is still needed to order tests so as to avoid expensive operations when possible.
A logical function over a small, finite domain can be replaced by a
table lookup.
Examples include using a table to classify
characters in lexical classes, and implementing an interpreter
using a branch table indexed by instruction code.
When the table is small and frequently-used, this technique is almost always helpful. However, when the table is sparse, or much larger than the executable code that it replaces, the effect of the table on the cache could swamp the benefits.
Assignment to a Boolean variable and its later test can be replaced by an if statement on the Boolean value. Generalizing, assignment to any control variable over a small, finite domain can be replaced by a case statement on the value that would have been assigned.
Application of this rule saves three things: declaration of a variable to hold a control value; assignment into the variable; and fetching from the variable when the value is tested. The rule says that instead, you should modify the code so the control value is used to direct execution as soon as the value is calculated. In some cases you have to duplicate code in order to apply the rule.
The 7.x compilers are good at reusing the stack space of local variables after their last use, and also good at saving calculated values in machine registers between their assignment and a nearby use. When the sole use of a local variable is to hold a value between its expression and a following test, the compiler (at -O2 and higher) is quite capable of eliminating the variable and using only a register temp.
Hence the only time you should distort your program's logic to apply this rule is when the assignment is separated from its use by many statements, or by a non-inlined procedure call, (A procedure call usually makes the compiler spill register values to memory.)
The run time of a set of procedures that (nonrecursively) call themselves can often be reduced by rewriting the procedures inline and binding what were passed arguments.
This technique is extremely useful, but difficult to apply to an existing program and still leave the program readable and maintainable. Fortunately, the 7.x compilers contain extensive support for interprocedural analysis (IPA) and automatic inlining. Given that, it is best to write the program in a plain, naive way, not fretting about frequent calls to small procedures, and to let the compiler inline the small procedures.
Although inlining eliminates the instruction overhead of a function call, it inflates the size of the object code. At a certain level of inlining, the effect of the inflated, low-density code on cache and virtual memory wastes as much time as it saves.
Keep in mind the maintenance implications of inlining procedures in code you will distribute to other users. A library procedure that has been inlined even once can never be field-replaced. You can send out a revised version of the library, but programs that inlined the function will never call the library version.
A procedure should be organized to handle the commonest cases most efficiently, and all remaining cases correctly (if more slowly). When designing the application, attempt to arrange the input conditions so that the efficient cases are the commonest.
The Caching rule can be seen as an extension of this rule, in which "the commonest case" is "whatever case was most recently requested." Bentley cites a Julian-date conversion function in which it was observed that 90% of all calls presented the same date as the previous call. Saving the last result, and returning it without recomputing when appropriate, saved time.
One way to apply the rule is to have a function test for the common cases and return their results quickly, falling through to the more involved code of the hard cases. However, such a function is likely to be a bulky one to inline. Instead, write two functions. The public function is short, and efficiently handles the most common cases. When it is called with a case it cannot handle, it calls the longer, private function that deals with hard cases.
The point of this design is that the public function is short enough that you can let the compiler inline it automatically. The common cases will be handled by the inlined code. The compiler should not inline the less-used, longer function.
A multiphase algorithm in which the phases are linked by temporary files (or arrays) can be reduced to a single-pass algorithm using coroutines.
Coroutines are defined and described in detail in Knuth (Volume I) and most other modern books on algorithmics. Under IRIX, you can write coroutines in a natural way using any one of three models:
A simple, readable solution expressed as a recursion can be
transformed into a more efficient iterative solution.
You can:
The 7.x compilers recognize simple cases of tail recursion and compile a goto to the top of the function, avoiding stack use. More complicated recursion defeats most of the compiler's ability to modify the source: recursive procedures are not inlined; a loop containing a recursion is not unrolled; and a loop with a recursion is not executed in parallel. Hence there is good reason to remove recursion, after it has served its purpose as a design tool.
Structure the program to exploit the parallelism available in the underlying hardware.
This rule is a no-brainer for the Origin 2000; parallel execution is what we do best! However, you can apply parallelism using a variety of programming models and at a wide range of scales from small to large. Two examples among many:
Parallel algorithm design is a very active area in computer science research. Execution of the following query:
+parallel +algorith* +(research | development | library)
on the Alta Vista search engine produced about 6000 hits, including two libraries of parallel algorithms at CMU, one for irregular problems and another one, and the ZRAM library at ETH in Switzerland.
As many variables (including arrays and tables) should be initialized prior to program execution.
This rule encompasses the use of compile-time initialization of variables -- in C, by use of initializing expressions in declarations, in Fortran by use of the PARAMETER clause -- and also the preparation of large arrays and tables.
Compile-time initialization with manifest constants allows the compiler greater scope. The compiler precalculates subexpressions that have all constant arguments. When IPA is used, the compiler recognizes when a procedure argument is always a given constant, and propagates the constant into the procedure body. When functions are inlined, constants are propagated into the inlined code, producing more constant expressions.
Some programs spend significant time initializing large arrays, tables, and other data structures. There are two common cases: clearing to zero, and processing initial data from one or more files.
A for-loop that does nothing but store zero into an array is a waste of CPU cycles. Worse, in NUMA system like the Origin 2000, all memory is by default allocated to the node from which it is first touched, which means that the zero array will be allocated in the node where the for-loop runs. So at the least, make sure that each parallel thread of the program initializes the array elements that it uses, and no others.
In a C program, it is much more efficient to use calloc(3) to allocate memory filled with zero. The page frames are defined but the actual pages are not created and not filled. Zero-filled pages are created as the program faults them in. A similar alternative is to use mmap(2) to map an extent of the pseudo-device /dev/zero. The mapped extent is added to the program's address space, but again, zero pages are created and allocated as they are faulted.
In a Fortran program, it is somewhat more efficient to call the C library function bzero() than to code a DO loop storing zero throughout an array.
The time to read and process file data is essentially wasted, since it does not contribute to the solution. Bentley recommends writing a separate program that reads the input data and processes it into an intermediate form that can be read and stored more quickly. For example, if the initializing data is in ASCII form, a side program can preprocess it to binary. When the side program uses the same data structures and array dimensions as the main program, the main program basically reads the file directly into the array space.
IRIX permits you to extend this idea. You can use mmap(2) to map any file into memory. You can take all the initialization code and isolate it in a separate program. This program maps a large block of memory to a new disk file. It reads the input file data, processes it in any way required, and builds a complete set of data structures and arrays of binary data in the mapped memory; then unmaps it. This writes an image of the processed data into the file.
When the main program starts up, it maps the initialized file into memory, and its initialization is instantly complete! Pages of memory are faulted in from the disk file as the program refers to them, complete with the binary content as left by the initializing program.
Replace the evaluation of a costly expression by one that is
algebraically equivalent but less costly.
Example: not
ln(A)+ln(B)
but ln(A*B) (give a passing thought to the chance that A*B
might overflow).
Bentley mentions that the frequent need to test whether an integer J is contained in the inclusive range (I,K), naively coded ((I<=J)&&(J<=K)), can be implemented as ((J-I)<(K-I)). The value (K-I) can be precomputed. (When I and K are constants or propagated from constant expressions, the compiler precomputes it automatically.)
Algebraic identities are at the heart of strength reduction in loops. The compiler can recognize most opportunities for strength reduction in an expression that includes the loop induction variable as an operand. However, strength reduction is a general principle, and the compiler cannot recognize every possibility. Examine every expression in a loop to see if the same value could not be calculated more cheaply by an incremental change in the same value from the previous iteration of the loop.
When the same expression is evaluated twice with no assignments to its variable operands, save the first evaluation and reuse it.
This rule, application of which is almost second nature to experienced programmers, is in fact applied well by the 7.x compilers. The compilers look for common subexpressions before and after constant propagation and inlining, and can often find subexpressions that are not superficially apparent in the source code. The expression value is held for its second use in a machine register if possible, or the compiler may create a temporary stack variable to hold it.
When similar expressions are evaluated together, consider making a new function that combines the evaluation of both.
Bentley cites Knuth as reporting that both sine and cosine of an angle can be computed as a pair for about 1.5 times the cost of computing one of them alone. The maximum and minimum values in a vector can be found in one pass for about 1.5 times the cost of finding either.
When it is not practical for the function to return both values, it can at least cache the paired answer (space-time rule 3) to be returned on request. In the case of sine/cosine, each of the functions can test to see if the argument angle is the same as the previous angle passed to either function. When it is not, call the function that calculates and caches both sine and cosine. Return the requested result.
In this regard, note that the 7.x compilers do recognize the dual nature of integer division and remainder, so that the sequence {div = a/b;rem = a%b;} compiles to a single integer divide.
Use the full word width of the underlying computer architecture to evaluate binary values in parallel.
This is a special case exploiting parallelism. The native integer in all current MIPS processors is 64 bits. The standard Fortran Boolean functions (IOR, IAND, etc.) handle INTEGER*8 values.