Overview of ARCHTEST

Copyright (C) 1994-96 Multiprocessor Diagnostics

Introduction
Uses for ARCHTEST
Rules Tested for by ARCHTEST
Tests Performed by ARCHTEST
Other Capabilities
Results of Testing
Example. Test T4
Research Problems
A True Experience
Multiprocessor Diagnostics
References
Appendix

Introduction

ARCHTEST is a program which runs on a shared memory multiprocessor to determine the memory model followed by the machine. A memory model consists of a set of behaviors; the behaviors can be either strong or relaxed. If all behaviors are strong, the machine is said to be sequentially consistent [LAM79]. Among the relaxed behaviors detected by ARCHTEST are (1) executing instructions out of order, and (2) making stale data visible to programs.

ARCHTEST provides a concrete and objective measure of a machine's logical behavior in accessing shared data. ARCHTEST is unique. No other program now provides the information provided by ARCHTEST.

Uses for ARCHTEST

Manufacturers of shared memory multiprocessors need ARCHTEST:

1. To identify areas in which the behavior of a machine can be made stronger or relaxed even further.
2. To ensure that machines that are supposed to be compatible are in fact compatible in operations on shared data.

At the heart of an operating system there are multiple threads operating on shared data. In this area programmers cannot use semaphores; to avoid slow, atomic instructions, they will use load and store instructions whenever possible. The writers of such code need to know the architecture, as identified by ARCHTEST, of the machines on which the software will run. Failure to understand subtle differences in machines can lead to rare, very hard to find bugs.

As multiprocessing moves to the desktop more multi-threaded code will be written. To avoid time-consuming calls to system functions, applications programmers will seek to do their own locking of data. Here, also, it will be vital to understand the architectures of the machines on which the code will run.

Rules tested for by ARCHTEST

The computation rules (CMP) describe how the terminal value of each operand is calculated from the initial values of the operands.

The rule of uniprocessor (UPO) order describes the ordering constraints that a uniprocessor must obey. Each processor on a multiprocessor must obey the UPO rule.

The order rules describe the constraints that a processor on a multiprocessor system may be required to obey:

Program order (PO) requires that all operations on shared data occur in the order defined by the program.
Read order (RR) requires that all read operations on shared data occur in the order defined by the program.
Write order (WW) requires that all write operations on shared data occur in the order defined by the program.
Read-Write order (RW) requires that if a read instruction occurs before a write instruction in the logical order defined by a program (and if both instructions access shared data), then the read instruction will be executed before the write instruction.
Write-Read order (WR) requires that if a write instruction occurs before a read instruction in the logical order defined by a program (and if both instructions access shared data), then the write instruction will be executed before the read instruction.

The rules of cache coherence describe the constraints that a multiprocessor system may be required to obey in write operations:

CC1: All threads see each operand change value at exactly the same instant. CC1 is also called write atomicity.
CC2: All threads see exactly the same sequence of changes of values for all operands.
CC3: All threads see exactly the same sequence of changes of values for each separate operand.
CC4: Different threads may see an operand assume different sequences of values.

No program can distinguish between CC1 and CC2 machines [COL92]. However, programs can distinguish between all of the other types of machines. A failure of a machine to obey CC1 (and CC2) demonstrates that the machine makes stale data visible to programs.

Tests Performed by ARCHTEST

ARCHTEST consists of twelve distinct tests, T1 through T12. Denote by A(X,Y,Z) the architecture consisting of the rules X, Y, and Z. Each test in ARCHTEST seeks to detect that a machine relaxes a particular architecture. The architecture for which each test seeks to detect a relaxation is:

T1** A(CMP, RR, WW)
T2** A(CMP, RR, WW)
T3** A(CMP, RR, WW)
T4** A(CMP, RR, WW, WR)
T4** A(CMP, RR, WW, RW)
T5** A(CMP, RR, WW, CC1)
T6** A(CMP, RR, WW, CC1)
T7** A(CMP, UPO, RR, WW, CC1)
T8** A(CMP, UPO, WW, CC3)
T9** A(CMP, UPO, WW, CC3)
T11** A(CMP, RR, WW, WR)
T11** A(CMP, RR, WW, RW)
T12** A(CMP, UPO, RR, WW, CC1)

(Test T10 is an experimental test which seeks to measure the performance impact of cast outs.)

If '**' = '00', then the test operates only on integer operands. If '**' = '10', then the test operates on both integer operands and floating point operands. If '**' = '20', then the test operates only on floating point operands. Tests T1, T8, and T9 operate only on a single operand and so for them there is no '10' case.

Tests T1-T7 were first described in Appendix D of [COL92].

Other Capabilities

For each interaction on shared data ARCHTEST calculates a numerical value representing how close the machine came to relaxing a rule. These values for each test are accumulated and presented in a histogram. The histograms describe the behavior of a machine. Some machines never relax a given rule; some do so often and to an extreme degree; others do so very infrequently (thereby suggesting that an unintended relaxation has become visible). The most extreme values seen so far show an instruction occurring around a couple of hundred instructions out of order, or a processor seeing a value of an operand which has been stale for about the same length of time.

Data can be generated either via machine simulation or via assembler language routines and then fed into ARCHTEST to obtain the full capabilites of ARCHTEST in analysing the data. This allows for data to be created in more idiosyncratic circumstances than is possible with a C language program (such as ARCHTEST). This feature also allows one to ensure that the behaviors observed by ARCHTEST truly originate in the hardware and are not an artifact of the C compiler.

The basic tests in ARCHTEST operate on only one, two, or four operands. It is conceivable that a machine could operate correctly on such a light load, but fail on a heavier load. Tests logically involving more operands can take an extremely long time to analyze. A simple compromise is to run a basic test accessing the essential operands, while constantly accessing some number of irrelevant shared operands. This increases the cache traffic, while allowing the analysis to remain brief. Two forms of irrelevant traffic can be generated: (1) irrelevant accesses can be made to lines other than the ones containing the essential operands, and/or irrelevant accesses can be to the same lines as the lines containing the essential operands.

Some of the data produced by ARCHTEST yields performance information. For example, ARCHTEST prints out the time to perform 1,000,000 load instructions, add instructions, subtract instructions, multiply instructions, and divide instructions (both integer and floating operations). ARCHTEST also records the time to perform each test and the time to analyze each test. In one case machine B was known to be twice as fast overall as machine A, yet the tests showed machine A to be faster (15% in some tests; 80% in others) on shared data operations than machine B.

Some other behaviors made visible by ARCHTEST are neither tests of the logic nor measurements of performance. The output from Test T6 can be plotted to show the interaction of six processors (two writing, four reading) on almost an instruction- by-instruction level. The output from Test T8 shows that on some machine instructions, the same sequences of values (called convoys) appear to each processor in turn. No one has yet been willing to judge such behavior as either good or bad.

If there is a need to analyze a particular behavior in greater detail than that performed by ARCHTEST, it is possible to dump the arrays created by ARCHTEST for later study.

ARCHTEST performs each logical test (1) on integer operands only, (2) on floating operands only, and (3) where possible, on both integer and floating operands.

Many parameters can be set and modified at run time.

Results of Testing

Twenty-four machines have had some or all of the tests run on them. For many machines no formal statement of their architecture has been given. Of those machines for which an architecture was known, only the Hitachi machine was more relaxed than was allowed by its architecture.

The amount of information associated with the following results varies widely. In some cases I personally made the test runs and so am confident of their validity. In many cases I relied on others to make the runs; their responses varied from an email message ("Passed all tests.") to a copy of the program output, but often with limited identification of the hardware system. These results should be regarded as tentative, until full, formal testing can be done.

The machines tested have fallen into one of three more or less clearly defined groups. Some machines were sequentially consistent (SC). This corresponds to an architecture of A(CMP,UPO,PO,CC1). Others obeyed A(CMP,UPO,RR,WW,RW,CC1), that is, they relaxed sequential consistency only in allowing reads to occur before logically preceding writes. Call this the POK architecture, since the IBM mainframes designed in Poughkeepsie were the first to follow this standard. The third group of machines are called processor consistent (PCon) [GOO89]. They relax both the WR and the CC1 rules. The architecture is A(CMP,UPO,RR,WW,RW,CC3). (More details on the testing can be found in the Appendix.)

The SC machines were:

1. The Ultra machine at New York University.
2. A Solborne 3-way Sparc at the Compaq Computer Corp.
3. An SGI Onyx at the University of Minnesota.
4. A 4-way Sun Sparc 630 at SUNY New Paltz.
5. A KSR-1 machine at the University of Toronto.

The POK machines were:

6. Three IBM mainframes.
7. An Amdahl mainframe.
8. An Intergraph 2-way TD4.
9. An ALR Revolution 2-way.
10. An AST Manhattan V Series 5090.
11. A Hewlett-Packard Vectra XU 5/90.
12. XBit Computer, ASUS motherboard.

The PCon machines were:

13. An Hitachi mainframe.
14. An SGI Power Challenge.
15. An ALR Revolution 4/100.
16. A Compaq Proliant 3-way.
17. A Compaq 4000 5/66.
18. An NCR 2-way.
19. An Olivetti 3-way.
20. A Sequent 6-way.
21. A Tricord server.
22. A 2-way Sun Sparc 20 at SUNY New Paltz.

In addition, a half dozen machines currently in development have been tested. No publishable results are yet available for them.

A recent study [A&G96] suggests that even more relaxed behavior will be visible on the Alpha, the PowerPC, the T3D, and some other machines.

Example. Test T4: Seek a relaxation of A(CMP,PO).

Sequential consistency requires that all read and write operations to shared data occur in the order defined by the program. The most common relaxation of PO is to allow read operations to occur in time before logically preceding write operations occur. Test T4 seeks to detect a relaxation of WR and of RW.

Suppose a 2-way multiprocessor executes the following code and calculates the given terminal values. The way to show that a read occurred before a logically preceding write (or even worse, that the machine relaxed the rule of computation) is to assume that the machine obeyed the rules of computation and of program order and then to deduce a contradiction.

     Initially, (A,B,U,V) = (0,0,0,0).

       P1         P2
     A = 1;     B = 1;
     U = B;     V = A;

     Terminally, (A,B,U,V) = (1,1,0,0).

If the machine obeys program order, then in P1 the write into A occurs before the read from B.

If the machine obeys the rule of computation, then since U is zero, the read from B in P1 occurs before the write of a one value into B in P2.

If the machine obeys program order, then in P2 the write into B occurs before the read from A.

If the machine obeys the rule of computation, then since V is zero, the read from A in P2 occurs before the write of a one value into A in P1.

Therefore, a circuit in time exists among the read and write operations for the program, which is a contradiction. Thus the machine fails to obey the rule of computation and/or the rule of program order.

For such a brief program to be successful as a test case, the two processors have to be exactly synchronized. Extending the test case in the manner shown below creates a test program that can be used on a real machine.

         P1             P2
      A    = 0;      B    = 0;
      U[0] = B;      V[0] = A;
      A    = 1;      B    = 1;
      U[1] = B;      V[1] = A;
      A    = 2;      B    = 2;
      U[2] = B;      V[2] = A;
      A    = 3;      B    = 3;
      U[3] = B;      V[3] = A;
        . . .          . . .
      A    = k;      B    = k;
      U[k] = B;      V[k] = A;

If k = 200,000, then there is a much longer window during which a relaxation of the rule of program order can be detected. Note that the processors do not have to be in sync; either can be interrupted away for a long period. So long as both execute at the same time, the possibility exists for them to reveal a relaxation of the rule of program order.

Research Questions

1. Tests T5, T6, and T7 all test for a relaxation of write atomicity. Test T7 shows that numerous machines relax WA. No machine to date has failed T5 or T6. Why do the machines that fail test T7, never fail the logically equivalent tests T5 or T6?
2. All machines found to obey program order, also obey write atomicity. Why is this? Conceptually, the two rules are quite distinct and independent.
3. Why do convoys occur? Are they good or bad?
4. Is compatibility needed. There are three major groups of behaviors, plus a couple of minor variations, plus more promised. Such variations can lead to bugs (see next section). Is there a need for industry standardization?
5. Relaxing architectures can lead to an improvement in machine performance. The effect, however, is reduced by an increased need to use atomic instructions in frequently executed routines at the heart of an operating system. What is the net benefit of relaxing an architecture?

A True Experience with Minisculy Distinct Architectures

An operating systems programmer for company A discovered a bug when the system ran on a machine from company B. The bug never occurred when the system ran on the supposedly compatible machine produced by company A. The programmer concluded that the B machine had a bug which was not present on the A machine.

The truth was difficult to get to. Eventually, both machines were found to obey the architecture defined for the machines. However, the A machine, as implemented, was stronger than the architecture required. The B machine implemented the architecture more accurately. The OS had not been coded to perform correctly on a machine obeying a strict implementation of the architecture, and so the fault was found to be in the OS, not in the B machine.

Multiprocessor Diagnostics

Multiprocessor Diagnostics was formed by William W. Collier in 1993 to develop and to market ARCHTEST. The development of ARCHTEST was supported by the Mid-Hudson InfoMall section of the Northeast Parallel Architecture Center at Syracuse University (Dennis Eaton, Director). It is a pleasure to acknowledge this support.

ARCHTEST is available to academics under a research license which allows long-term, noncommercial use of ARCHTEST at no fee. Researchers are expected to undertake extensive use of ARCHTEST and to make the results of such use available to others, either in journals or on the web.

Commercial use of ARCHTEST requires either a site license or an entity license. A site license is $5,000 (a one-time charge); it allows use of ARCHTEST by one company in any one county in the U.S.A. (or the equivalent overseas). An entity license is $25,000 (a one-time charge). It allows use of ARCHTEST by a company anywhere in the world. Purchase orders can be faxed to 914-896-7675.

ARCHTEST is about 10,000 lines of C code. A single flag allows ARCHTEST to compile to run on Windows NT, Solaris, or UnixWare 2.0. For other systems users may have to supply code to start and to stop multithreading.

For further information see http://www.infomall.org or contact William W. Collier at Multiprocessor Diagnostics, 13 Gary Place, Wappingers Falls, New York 12590. Tel: 914-297-5901. E-mail: collier@acm.org.

References

A&G96. S. Adve and K. Gharachorloo. http://www-ece.rice.edu /~sarita/Publications/model_tutorial.ps. 1996.
COL92. W. W. Collier. Reasoning About Parallel Architectures. Prentice-Hall, Englewood Cliffs, NJ. 1992.
GOO89. J. Goodman. Cache Consistency and Sequential Consistency. Technical Report No. 61, SCI Committee, March 1989.
LAM79. L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers C28:9 (1979).

Appendix

The results of testing 24 machines are summarized below. Some number of testers who provided data were not willing to be identified.

Key: N=n n is the number of processors on the system. T=t t is the number of threads in the test run. K=k k is the length of the arrays used in gathering data. "." => Test was not run either because too few threads were defined or the test had not been implemented at the time of the test. Tests T1, T2, T4, T7, and T8 run with two threads; the others require more. "O" => Test was run; strong behavior was observed. "X" => Test was run; relaxed behavior was observed.

     1.  The Ultra machine at New York University.
  N=4 T=4 K=20000    1991  M. Ayala
  1 O. 2 O.. 3 O.. 4 O.. 5 O.. 6 O.. 7 O.. 8 .. 9 .. 11 ... 12 ...

     2.  A Solborne 3-way Sparc at the Compaq Corp.
  N=3 T=3 K=?        1993  D. Koenen
  1 .. 2 O.. 3 ... 4 O.. 5 O.. 6 ... 7 O.. 8 .. 9 .. 11 ... 12 ...

     3.  An SGI Onyx at the University of Minnesota.
  N=4 T=4 K=20000    1994  F. Mounes-Toussi, D. Lilja, Univ. of Minnesota
  1 .. 2 OO. 3 OO. 4 OO. 5 OO. 6 OO. 7 OO. 8 O. 9 .. 11 ... 12 ...

     4.  A 4-way Sun Sparc 630 at SUNY New Paltz.
  N=4 T=4 K=200000   1996  W. Collier
  1 OO 2 OOO 3 OOO 4 OOO 5 OOO 6 OOO 7 OOO 8 OO 9 OO 11 OOO 12 OOO

     5.  A KSR-1 machine at the University of Toronto.
  N=8 T=8 K=500000   1996  N. Manjikian, University of Toronto
  1 OO 2 OOO 3 OOO 4 OOO 5 OOO 6 OOO 7 OOO 8 OO 9 .. 11 ... 12 ...
     The program ran out of time before completing the tests.

     6.  Three IBM mainframes.
  N=4 T=4 K=200000   1992  W. Collier
  1 O. 2 OO. 3 OO. 4 XX. 5 OO. 6 OO. 7 OO. 8 .. 9 .. 11 ... 12 ...

     7.  An Amdahl mainframe.
  N=4 T=4 K=200000   1992  W. Collier
  1 O. 2 OO. 3 OO. 4 XX. 5 OO. 6 OO. 7 OO. 8 .. 9 .. 11 ... 12 ...

     8.  An Intergraph 2-way TD4
  N=2 T=2 K=20000    1995  Anon
  1 .. 2 OO. 3 ... 4 XO. 5 ... 6 ... 7 OO. 8 O. 9 .. 11 ... 12 ...

     9.  An ALR Revolution 2-way.
  N=2 T=2 K=500000   1995  Anon
  1 .. 2 OO. 3 ... 4 XX. 5 ... 6 ... 7 OO. 8 O. 9 .. 11 ... 12 ...

     10.  An AST Manhattan V Series 5090.
  N=2 T=2 K=20000    1995  Anon
  1 .. 2 OO. 3 ... 4 XO. 5 ... 6 ... 7 OO. 8 O. 9 .. 11 ... 12 ...

     11.  A Hewlett-Packard Vectra XU 5/90.
  N=2 T=2 K=20000    1995  Anon
  1 .. 2 OO. 3 ... 4 OO. 5 ... 6 ... 7 OO. 8 O. 9 .. 11 ... 12 ...
  N=2 T=2 K=250000   1995  Anon
  1 .. 2 OO. 3 ... 4 XX. 5 ... 6 ... 7 OO. 8 O. 9 .. 11 ... 12 ...
     The histograms of the test results for this machine show behavior
which is only rarely and just barely over the edge in relaxed
territory.

     12.  XBit Computer, ASUS motherboard
  N=2 T=2 K=200000   1995  Anon
  1 OO 2 ... 3 ... 4 XX. 5 ... 6 ... 7 OO. 8 O. 9 .. 11 ... 12 ...


     13.  An Hitachi mainframe.
  N=4 T=4 K=200000   1992  W. Collier
  1 O. 2 OO. 3 OO. 4 XX. 5 OO. 6 OO. 7 XX. 8 .. 9 .. 11 ... 12 ...
     The architecture for this machine is POK; it should not have
failed Test T7.

     14.  An SGI Power Challenge.
  N=2 T=2 K=500000   1995  F. Mounes-Toussi, D. Lilja, Univ. of Minnesota
  1 .. 2 OO. 3 ... 4 OX. 5 ... 6 ... 7 OX. 8 O. 9 .. 11 ... 12 ...
     This machine is SC except when operating on both integer and
floating operands.

     15.  An ALR Revolution 4/100.
  N=4 T=4 K=400000   1995  Anon
  1 .. 2 OO. 3 OO. 4 XX. 5 OO. 6 OO. 7 XX. 8 O. 9 .. 11 ... 12 ...

     16.  A Compaq Proliant 3-way
  N=3 T=3 K=500000   1995  Anon
  1 .. 2 OO. 3 OO. 4 XX. 5 OO. 6 OO. 7 XX. 8 O. 9 .. 11 ... 12 ...

     17.  A Compaq 4000 5/66.
  N=R T=R K=20000    1995  Anon
  1 .. 2 OO. 3 OO. 4 XX. 5 OO. 6 OO. 7 XX. 8 O. 9 .. 11 ... 12 ...

     18.  An NCR 2-way.
  N=2 T=2 K=500000   1995  Anon
  1 .. 2 OO. 3 ... 4 XX. 5 ... 6 ... 7 XX. 8 O. 9 .. 11 ... 12 ...

     19.  An Olivetti 3-way.
  N=3 T=3 K=500000   1995  Anon
  1 .. 2 OO. 3 ... 4 XX. 5 OO. 6 ... 7 XX. 8 O. 9 .. 11 ... 12 ...

     20.  A Sequent 6-way
  N=6 T=6 K=250000   1995  Anon
  1 .. 2 OO. 3 OO2. 4 XX. 5 OO. 6 OO. 7 XX. 8 O. 9 .. 11 ... 12 ...

     21.  A Tricord server.
  N=8 T=8 K=20000    1995  M. Olson
  1 .. 2 OOO 3 ... 4 XXX 5 ... 6 ... 7 XXX 8 O. 9 .. 11 ... 12 ...

     22.  A 2-way Sun Sparc 20 at SUNY New Paltz.
  N=2 T=2 K=200000   1996  W. Collier
  1 OO 2 OOO 3 ... 4 XXX 5 ... 6 ... 7 XXX 8 OO 9 .. 11 ... 12 ...

Last updated May 1, 1996.