Bentley's Rules from Writing Efficient Programs

Bentley's Rules

Jon Bentley, a distinguished computer scientist known for many publications, published Writing Efficient Programs. In this out-of-print classic, Bentley provides a unified, pragmatic treatment of program efficiency, independent of language and host platform.

For ease of presentation, Bentley codified his methods as a set of terse rules. The following is the list of Bentley's rules, edited from Appendix C of Writing Efficient Programs and adapted to suit the special needs and opportunities of the Origin 2000 architecture, SGI version 7.x compilers, and the MIPS instruction set.

Space-for-Time Rules
Time-for-Space Rules
1. Packing
2. Interpreters
Loop Rules
Logic Rules
Procedure Design Rules
Expression Rules

Space-for-Time Rules

1. Data Structure Augmentation

The time required for common operations on data can often be reduced by augmenting the structure with additional information, or by changing the information within the structure so it can be accessed more easily. Examples:

Reference counters facilitate garbage collection.
A "hint" containing a likely, but possibly incorrect, answer.

For the Origin 2000, with a 100:1 ratio between instruction time and memory-fetch time, it is more important to minimize secondary cache misses than to minimize instruction execution. Special applications of this rule include:

Reorganize data structures to make sure that the most-used fields fit in a single, 128-byte, cache line, and less-used fields fall into separate cache lines. (This of course presumes that each structure is 128-byte aligned in memory.)
In one example cited in a paper given at Supercomputing '96, most of the cache misses of one program were caused by "a search loop that accesses a four-byte integer field in each element of an array of structures...By storing the integer items in a separate, shadow array, the secondary cache miss rate of the entire application is reduced."
Reorganize structures so that a structure (or cache line) that can be updated is used by at most one CPU at a time. The alternative, when multiple CPUs share a cache line that can be updated, requires you to use mutual exclusion, but in addition, when one CPU updates a cache line, all the others must reload the data from memory.

2. Store Precomputed Results

The cost of recomputing an expensive function can be reduced by computing the function only once and storing the results. Subsequent requests for the function are handled by table lookup.

Precomputation is almost certainly a win when the replaced computation involves disk access, or an IRIX system call, or even a substantial libc function. Before precomputing an ordinary calculation, consider that if retrieval of the precomputed result can lead to an additional cache miss, the replaced computation must cost at least 100 instructions to break even.

3. Caching

Data that is accessed most often should be the cheapest to access. However, caching can backfire when locality is not in fact present. Examples:

The move-to-front heuristic for sequential list management is the classic example.
In implementing a dictionary, keep the most common words in memory.
Database systems cache the most-recently used tuples as a matter of course.

The MIPS R10000, Origin 2000 node, and IRIX kernel already cache at a number of levels:

Primary (on-chip) caches for instructions and data.
Secondary (on-board) cache for recently-used memory lines of 128 bytes.
Virtual memory system keeps most-recently used pages of 16KB and up, and also migrates pages to the nodes where they are most often referenced.
Filesystems (XFS and NFS independently) cache megabytes of recently-used records.

However, all these standard caches are shared by multiple processes (from the standpoint of your program, the other processes are "diluting" the cache with their data, driving out your data). Also, system cache management knows nothing about your algorithms. Application-dependent caching can yield huge payoffs, especially when it avoids disk access.

4. Lazy Evaluation

Never evaluate an item until it is needed. Examples:

In building a table of Fibonacci numbers (or any other function), only populate the table with numbers actually requested, as they are requested.
Brian Kernighan reduced the run time of [the original troff] by twenty percent by calculating the width of the current line when needed, rather than after each character.

Time-for-Space Rules

1. Packing

Dense storage representations can decrease storage costs by increasing the time to store and retrieve data. Typical examples include packing multiple binary integers into 32- or 64-bit words.

The memory complement of a typical Origin 2000 server is large enough that you might think such bit-squeezing techniques should be relics of the past. However, is file access time a significant factor in your program's speed? (Use prof(1) to find out.) If the file format can be squeezed 50%, sequential access time is cut by that much, and direct access time is reduced due to shorter seek distances. Also, if a key dataset does not fit into cache, and by packing a data structure you can get all of it into cache, the extra instruction cycles to unpack each item are repaid many times over.

2. Interpreters

The space required to represent a program can often be decreased by the use of interpreters in which common sequences of operations are represented compactly. A typical example is the use of a finite-state machine to encode a complex protocol or lexical format into a small table.

Finite-state machines, decision tables, and token interpreters can all yield elegant, maintainable algorithm designs. However, keep in mind that the coding techniques you use to implement an interpreter, in particular computed gotos and vector tables, tend to defeat the compiler's code manipulations. Don't expect the compiler to optimize or parallelize the interpreter's central loop.

Loop Rules

Many of these loop-transformation rules have been incorporated into the SGI 7.x compilers. This makes it easier to apply them in controlled ways using compiler options.

1. Code Motion Out of Loops

An invariant expression (one whose value does not depend on the loop variable) should be calculated once, outside the loop, rather than iteratively. But keep in mind:

Generally, the compiler is good at recognizing invariant expressions that are evident in the code. Place expressions where it is most natural to read or write them, and let the compiler move them for you. Example:
```
for (i=0;i<Z;++i) { if (x[i]<(fact/div))...; }
```
At nonzero optimization levels the compiler will recognize fact/div as invariant, and move it to a generated temporary in the loop prologue. You should leave the code in this naive but readable form, and not clutter it with yet another scratch variable.
Generally the compiler cannot recognize when a call to a user function is invariant.
```
for (i=0;i<Z;++i) { if (x[i]<func(fact,div))...; }
```
When func(fact,div) does not depend on i, you should move the call out of the loop:
```
for (i=0,t=func(fact,div);i<Z;++i) { if (x[i]<t)...; }
```
An exception is when you request function inlining. After the compiler inlines the function body, it may then recognize that some or all of the inlined code is invariant, and move those parts out of the loop.

2. Combining Tests

An efficient inner loop should contain as few tests as possible, and preferably only one. Try to combine exit conditions. Examples:

Add a sentinel value to the last item of an unsorted vector to avoid the test for last index.
As a special case, use a sentinel byte value to signal the end of a string of bytes, avoiding the test for last-byte-index.

This optimization can be especially helpful with the R10000. The R10000 can pipeline a single conditional branch, executing speculatively along the most likely arm. However, two conditional branches in quick succession can interfere with the pipeline and cause the CPU to skip cycles until the branch condition values are available.

3. Loop Unrolling

A large cost of some short loops is in modifying loop indexes. That cost can often be reduced by unrolling the loop.

This optimization can bring large benefits, especially in a superscalar CPU like the R10000. However, loop unrolling is a touchy, error-prone operation to carry out by hand, and the resulting code is hard to read and to maintain. Fortunately, the 7.x compilers have extensive support for loop unrolling. You can control the amount of unrolling either with a compiler option or loop-by-loop with directives. The compiler takes care of the niggling details and the bookkeeping involved in handling end-of-loop, while the source code remains readable.

A benefit of compiler-controlled loop unrolling is that the compiler applies the optimizations of common subexpression elimination, constant propagation, code motion out of loops, and function call inlining both before and after unrolling the loop. The optimizations work together for a synergistic effect. The software-pipelining optimization (rearranging the order of the generated machine instructions in order to maximize use of the R10000 execution units) also becomes more effective when it has a longer loop body to work on.

4. Transfer-Driven Loop Unrolling

When a large cost of an inner loop is devoted to trivial assignments (other than maintenance of loop indices), the assignments can be reduced by duplicating the loop body and renaming the variables in the second copy.

As with simple loop unrolling, this important optimization is well-handled by the compiler. Leave your code simple and readable and let the compiler do this.

5. Unconditional Branch Removal

A fast loop should have no unconditional branches. An unconditional branch at the end of a loop can be handled by "rotating" the loop to put a conditional branch at the bottom.

This optimization might be needed in assembly-language code, but no modern compiler should generate code for a loop with the test at the top and an unconditional branch at the bottom.

6. Loop Fusion

If two nearby loops operate on the same set of elements, combine their operational parts and use only one set of loop-control operations. Example: convert

for (i=0;i<n;i++)
   for (j=0;j<n;j++)
      b[i][j] = a[i][j];
for (i=0;i<n;i++)
   for (j=0;j<n;j++)
      c[i][j] = b[i][j];

into the single loop

for (i=0;i<n;i++)
   for (j=0;j<n;j++)
      b[i][j] = a[i][j];
      c[i][j] = b[i][j];

This technique can produce important benefits in the Origin 2000 because, first, the contents of array b are only passed through the cache once, and second, the instructions to load each element of b (from the second loop) are eliminated. However, it is not necessary for you to perform this optimization manually. The 7.x compilers perform loop fusion automatically at certain optimization levels. The compilers can also perform the following optimizations:

Loop interchange -- exchanging the induction variables of inner and outer nested loops so as to pass over memory consecutively.
Cache blocking -- breaking a single loop nest into a sequence of smaller nests that operate on blocks of data that each fit in secondary cache.
Loop fission -- when two inner loops are controlled by a single outer loop, duplicating the outer loop and creating two simple loop nests that can then be interchanged, blocked, or unrolled.

Each of these changes, if done manually, complicates the source code with many temporary variables and insanely complicated loop control. In general, write loop code in the simplest, most portable and maintainable fashion, and then let the compiler tangle it behind the scenes.

Logic Rules

1. Exploit Algebraic Identities

In a conditional expression, replace a costly expression with an algebraically equivalent expression that is cheaper to evaluate. Examples:

Not (sqr(X)>0) but (X!=0). If this also allows you to avoid calculating sqr(X) unless it is needed, it is an instance of Lazy Evaluation.
Not (not(A) .and. not(B)) but not(A .or. B) (DeMorgan's Law).

Any such replacements that require insight are worth doing. Do not spend time replacing simple manifest expressions with constants; the compiler is good at that, and good at recognizing common subexpressions.

2. Short-Circuit Monotone Functions

When a monotone function of several variables is to be tested for a threshold value, break off the test as soon as the result is known.

Short-circuit evaluation of a Boolean expression is a well-known case (handled automatically by the compiler). Bentley gives a more interesting example. Given a function to find the point in a list that is nearest to a given point:

typedef struct point_s { double X, Y; } point_t;
#include <math.h>
int nearest_XY(int n, point_t given, point_t points[])
{
   int j, close_n=-1;
   double thisDist, closeDist=MAXFLOAT;
   for (j=0; j<n; ++j)
   {
      thisDist = sqr(points[j].X-given.X) + sqr(points[j].Y-given.Y);
      if (thisDist < closeDist)
      {
         closeDist = thisDist;
         close_n = j;
      }
   }
   return close_n;
}

This function as it stands short-circuits the distance computation (for purposes of comparing two distances, it is sufficient to compare the sum of squares and skip doing the square root). However, in many cases the X-distance alone is enough to eliminate a point, which avoids the need for the second multiply and an addition.

int nearest_XZ(int n, point_t given, point_t points[])
{
   int j, close_n=-1;
   double thisDist, closeDist=MAXFLOAT;
   for (j=0; j<n; ++j)
   {
      thisDist = sqr(points[j].X-given.X);
      if (thisDist < closeDist) /* possible */
      {
         thisDist += sqr(points[j].Y-given.Y);
         if (thisDist < closeDist)
         {
            closeDist = thisDist;
            close_n = j;
         }
      }
   }
   return close_n;
}

3. Reorder Tests

Logical tests should be arranged such that cheap tests precede expensive ones, and so that ones most likely to succeed precede ones more likely to fail.

In the absence of other information, the SGI compilers assume that the then part of an if is more likely than the else, so it is good practice to put the most likely code in the then part.

It is possible to run the program under speedshop (experiment type ideal), producing a file of exact execution counts. This file can be filtered through prof (format -feedback) and the resulting file can be fed back into the next compilation. The compiler will take branch execution counts from the feedback file, and create an instruction schedule that reflects the exact probabilities of each branch.

This is a convenient feature that can have good results on a poorly-written program. However, the compiler does not take into account the cost of a test that involves a function call, or a test that can touch an unused cache line or page. Your insight is still needed to order tests so as to avoid expensive operations when possible.

4. Precompute Logical Functions

A logical function over a small, finite domain can be replaced by a table lookup. Examples include using a table to classify characters in lexical classes, and implementing an interpreter using a branch table indexed by instruction code.

When the table is small and frequently-used, this technique is almost always helpful. However, when the table is sparse, or much larger than the executable code that it replaces, the effect of the table on the cache could swamp the benefits.

5. Control Variable Elimination

Assignment to a Boolean variable and its later test can be replaced by an if statement on the Boolean value. Generalizing, assignment to any control variable over a small, finite domain can be replaced by a case statement on the value that would have been assigned.

Application of this rule saves three things: declaration of a variable to hold a control value; assignment into the variable; and fetching from the variable when the value is tested. The rule says that instead, you should modify the code so the control value is used to direct execution as soon as the value is calculated. In some cases you have to duplicate code in order to apply the rule.

The 7.x compilers are good at reusing the stack space of local variables after their last use, and also good at saving calculated values in machine registers between their assignment and a nearby use. When the sole use of a local variable is to hold a value between its expression and a following test, the compiler (at -O2 and higher) is quite capable of eliminating the variable and using only a register temp.

Hence the only time you should distort your program's logic to apply this rule is when the assignment is separated from its use by many statements, or by a non-inlined procedure call, (A procedure call usually makes the compiler spill register values to memory.)

Procedure Design Rules

1. Collapse Procedure Hierarchies

The run time of a set of procedures that (nonrecursively) call themselves can often be reduced by rewriting the procedures inline and binding what were passed arguments.

This technique is extremely useful, but difficult to apply to an existing program and still leave the program readable and maintainable. Fortunately, the 7.x compilers contain extensive support for interprocedural analysis (IPA) and automatic inlining. Given that, it is best to write the program in a plain, naive way, not fretting about frequent calls to small procedures, and to let the compiler inline the small procedures.

Although inlining eliminates the instruction overhead of a function call, it inflates the size of the object code. At a certain level of inlining, the effect of the inflated, low-density code on cache and virtual memory wastes as much time as it saves.

Keep in mind the maintenance implications of inlining procedures in code you will distribute to other users. A library procedure that has been inlined even once can never be field-replaced. You can send out a revised version of the library, but programs that inlined the function will never call the library version.

2. Exploit Common Cases

A procedure should be organized to handle the commonest cases most efficiently, and all remaining cases correctly (if more slowly). When designing the application, attempt to arrange the input conditions so that the efficient cases are the commonest.

The Caching rule can be seen as an extension of this rule, in which "the commonest case" is "whatever case was most recently requested." Bentley cites a Julian-date conversion function in which it was observed that 90% of all calls presented the same date as the previous call. Saving the last result, and returning it without recomputing when appropriate, saved time.

One way to apply the rule is to have a function test for the common cases and return their results quickly, falling through to the more involved code of the hard cases. However, such a function is likely to be a bulky one to inline. Instead, write two functions. The public function is short, and efficiently handles the most common cases. When it is called with a case it cannot handle, it calls the longer, private function that deals with hard cases.

The point of this design is that the public function is short enough that you can let the compiler inline it automatically. The common cases will be handled by the inlined code. The compiler should not inline the less-used, longer function.

3. Use Coroutines

A multiphase algorithm in which the phases are linked by temporary files (or arrays) can be reduced to a single-pass algorithm using coroutines.

Coroutines are defined and described in detail in Knuth (Volume I) and most other modern books on algorithmics. Under IRIX, you can write coroutines in a natural way using any one of three models:

The UNIX model of forked processes that communicate using pipes.
The POSIX threads model using POSIX message queues to communicate.
The MPI (or PVM) model of cooperating processes exchanging messages.

4. Transform Recursive Procedures

A simple, readable solution expressed as a recursion can be transformed into a more efficient iterative solution. You can:

Code an explicit recursion using a stack variable.
When the final action of a procedure is to call itself ("tail recursion"), replace the call with a goto to the top of the procedure. (Then rewrite the iteration as a loop.)
Cut off the recursion early for small arguments (for example using a table lookup), instead of recurring down to the case of size 0.

The 7.x compilers recognize simple cases of tail recursion and compile a goto to the top of the function, avoiding stack use. More complicated recursion defeats most of the compiler's ability to modify the source: recursive procedures are not inlined; a loop containing a recursion is not unrolled; and a loop with a recursion is not executed in parallel. Hence there is good reason to remove recursion, after it has served its purpose as a design tool.

5. Use Parallelism

Structure the program to exploit the parallelism available in the underlying hardware.

This rule is a no-brainer for the Origin 2000; parallel execution is what we do best! However, you can apply parallelism using a variety of programming models and at a wide range of scales from small to large. Two examples among many:

Using the POSIX-compliant asynchronous I/O library, you can schedule one or an entire list of file accesses to be performed in a parallel thread, with all the file-synchronous waits being taken by a different process.
The native integer in the R10000 has 64 bits (under either -n32 or -64 compiler option). You can pack many smaller units into long long (or INT*8) variables and operate on them in groups.

Parallel algorithm design is a very active area in computer science research. Execution of the following query:

+parallel +algorith* +(research | development | library)

on the Alta Vista search engine produced about 6000 hits, including two libraries of parallel algorithms at CMU, one for irregular problems and another one, and the ZRAM library at ETH in Switzerland.

Expression Rules

1. Initialize Data Before Runtime

As many variables (including arrays and tables) should be initialized prior to program execution.

This rule encompasses the use of compile-time initialization of variables -- in C, by use of initializing expressions in declarations, in Fortran by use of the PARAMETER clause -- and also the preparation of large arrays and tables.

Compile-time initialization with manifest constants allows the compiler greater scope. The compiler precalculates subexpressions that have all constant arguments. When IPA is used, the compiler recognizes when a procedure argument is always a given constant, and propagates the constant into the procedure body. When functions are inlined, constants are propagated into the inlined code, producing more constant expressions.

Some programs spend significant time initializing large arrays, tables, and other data structures. There are two common cases: clearing to zero, and processing initial data from one or more files.

Initializing to Zero

A for-loop that does nothing but store zero into an array is a waste of CPU cycles. Worse, in NUMA system like the Origin 2000, all memory is by default allocated to the node from which it is first touched, which means that the zero array will be allocated in the node where the for-loop runs. So at the least, make sure that each parallel thread of the program initializes the array elements that it uses, and no others.

In a C program, it is much more efficient to use calloc(3) to allocate memory filled with zero. The page frames are defined but the actual pages are not created and not filled. Zero-filled pages are created as the program faults them in. A similar alternative is to use mmap(2) to map an extent of the pseudo-device /dev/zero. The mapped extent is added to the program's address space, but again, zero pages are created and allocated as they are faulted.

In a Fortran program, it is somewhat more efficient to call the C library function bzero() than to code a DO loop storing zero throughout an array.

Initializing From Files

The time to read and process file data is essentially wasted, since it does not contribute to the solution. Bentley recommends writing a separate program that reads the input data and processes it into an intermediate form that can be read and stored more quickly. For example, if the initializing data is in ASCII form, a side program can preprocess it to binary. When the side program uses the same data structures and array dimensions as the main program, the main program basically reads the file directly into the array space.

IRIX permits you to extend this idea. You can use mmap(2) to map any file into memory. You can take all the initialization code and isolate it in a separate program. This program maps a large block of memory to a new disk file. It reads the input file data, processes it in any way required, and builds a complete set of data structures and arrays of binary data in the mapped memory; then unmaps it. This writes an image of the processed data into the file.

When the main program starts up, it maps the initialized file into memory, and its initialization is instantly complete! Pages of memory are faulted in from the disk file as the program refers to them, complete with the binary content as left by the initializing program.

2. Exploit Algebraic Identities

Replace the evaluation of a costly expression by one that is algebraically equivalent but less costly. Example: not ln(A)+ln(B) but ln(A*B) (give a passing thought to the chance that A*B might overflow).

Bentley mentions that the frequent need to test whether an integer J is contained in the inclusive range (I,K), naively coded ((I<=J)&&(J<=K)), can be implemented as ((J-I)<(K-I)). The value (K-I) can be precomputed. (When I and K are constants or propagated from constant expressions, the compiler precomputes it automatically.)

Algebraic identities are at the heart of strength reduction in loops. The compiler can recognize most opportunities for strength reduction in an expression that includes the loop induction variable as an operand. However, strength reduction is a general principle, and the compiler cannot recognize every possibility. Examine every expression in a loop to see if the same value could not be calculated more cheaply by an incremental change in the same value from the previous iteration of the loop.

3. Eliminate Common Subexpressions

When the same expression is evaluated twice with no assignments to its variable operands, save the first evaluation and reuse it.

This rule, application of which is almost second nature to experienced programmers, is in fact applied well by the 7.x compilers. The compilers look for common subexpressions before and after constant propagation and inlining, and can often find subexpressions that are not superficially apparent in the source code. The expression value is held for its second use in a machine register if possible, or the compiler may create a temporary stack variable to hold it.

4. Combine Paired Computation

When similar expressions are evaluated together, consider making a new function that combines the evaluation of both.

Bentley cites Knuth as reporting that both sine and cosine of an angle can be computed as a pair for about 1.5 times the cost of computing one of them alone. The maximum and minimum values in a vector can be found in one pass for about 1.5 times the cost of finding either.

When it is not practical for the function to return both values, it can at least cache the paired answer (space-time rule 3) to be returned on request. In the case of sine/cosine, each of the functions can test to see if the argument angle is the same as the previous angle passed to either function. When it is not, call the function that calculates and caches both sine and cosine. Return the requested result.

In this regard, note that the 7.x compilers do recognize the dual nature of integer division and remainder, so that the sequence {div = a/b;rem = a%b;} compiles to a single integer divide.

5. Exploit Word Parallelism

Use the full word width of the underlying computer architecture to evaluate binary values in parallel.

This is a special case exploiting parallelism. The native integer in all current MIPS processors is 64 bits. The standard Fortran Boolean functions (IOR, IAND, etc.) handle INTEGER*8 values.

Saleh Elmohamed, saleh@npac.syr.edu