ARCHTEST provides a concrete and objective measure of a machine's logical behavior in accessing shared data. ARCHTEST is unique. No other program now provides the information provided by ARCHTEST.
Uses for ARCHTEST
Manufacturers of shared memory multiprocessors need ARCHTEST:
As multiprocessing moves to the desktop more multi-threaded code will be written. To avoid time-consuming calls to system functions, applications programmers will seek to do their own locking of data. Here, also, it will be vital to understand the architectures of the machines on which the code will run.
Rules tested for by ARCHTEST
The computation rules (CMP) describe how the terminal value of
each operand is calculated from the initial values of the operands.
The rule of uniprocessor (UPO) order describes the ordering constraints that a uniprocessor must obey. Each processor on a multiprocessor must obey the UPO rule.
The order rules describe the constraints that a processor on a multiprocessor system may be required to obey:
Tests Performed by ARCHTEST
ARCHTEST consists of twelve distinct tests, T1 through T12.
Denote by A(X,Y,Z) the architecture consisting of the rules X, Y, and
Z. Each test in ARCHTEST seeks to detect that a machine relaxes a
particular architecture. The architecture for which each test seeks
to detect a relaxation is:
If '**' = '00', then the test operates only on integer operands. If '**' = '10', then the test operates on both integer operands and floating point operands. If '**' = '20', then the test operates only on floating point operands. Tests T1, T8, and T9 operate only on a single operand and so for them there is no '10' case.
Tests T1-T7 were first described in Appendix D of [COL92].
Other Capabilities
For each interaction on shared data ARCHTEST calculates a
numerical value representing how close the machine came to relaxing a
rule. These values for each test are accumulated and presented in a
histogram. The histograms describe the behavior of a machine. Some
machines never relax a given rule; some do so often and to an extreme
degree; others do so very infrequently (thereby suggesting that an
unintended relaxation has become visible). The most extreme values
seen so far show an instruction occurring around a couple of hundred
instructions out of order, or a processor seeing a value of an operand
which has been stale for about the same length of time.
Data can be generated either via machine simulation or via assembler language routines and then fed into ARCHTEST to obtain the full capabilites of ARCHTEST in analysing the data. This allows for data to be created in more idiosyncratic circumstances than is possible with a C language program (such as ARCHTEST). This feature also allows one to ensure that the behaviors observed by ARCHTEST truly originate in the hardware and are not an artifact of the C compiler.
The basic tests in ARCHTEST operate on only one, two, or four operands. It is conceivable that a machine could operate correctly on such a light load, but fail on a heavier load. Tests logically involving more operands can take an extremely long time to analyze. A simple compromise is to run a basic test accessing the essential operands, while constantly accessing some number of irrelevant shared operands. This increases the cache traffic, while allowing the analysis to remain brief. Two forms of irrelevant traffic can be generated: (1) irrelevant accesses can be made to lines other than the ones containing the essential operands, and/or irrelevant accesses can be to the same lines as the lines containing the essential operands.
Some of the data produced by ARCHTEST yields performance information. For example, ARCHTEST prints out the time to perform 1,000,000 load instructions, add instructions, subtract instructions, multiply instructions, and divide instructions (both integer and floating operations). ARCHTEST also records the time to perform each test and the time to analyze each test. In one case machine B was known to be twice as fast overall as machine A, yet the tests showed machine A to be faster (15% in some tests; 80% in others) on shared data operations than machine B.
Some other behaviors made visible by ARCHTEST are neither tests of the logic nor measurements of performance. The output from Test T6 can be plotted to show the interaction of six processors (two writing, four reading) on almost an instruction- by-instruction level. The output from Test T8 shows that on some machine instructions, the same sequences of values (called convoys) appear to each processor in turn. No one has yet been willing to judge such behavior as either good or bad.
If there is a need to analyze a particular behavior in greater detail than that performed by ARCHTEST, it is possible to dump the arrays created by ARCHTEST for later study.
ARCHTEST performs each logical test (1) on integer operands only, (2) on floating operands only, and (3) where possible, on both integer and floating operands.
Many parameters can be set and modified at run time.
Results of Testing
Twenty-four machines have had some or all of the tests run on
them. For many machines no formal statement of their architecture has
been given. Of those machines for which an architecture was known,
only the Hitachi machine was more relaxed than was allowed by its
architecture.
The amount of information associated with the following results varies widely. In some cases I personally made the test runs and so am confident of their validity. In many cases I relied on others to make the runs; their responses varied from an email message ("Passed all tests.") to a copy of the program output, but often with limited identification of the hardware system. These results should be regarded as tentative, until full, formal testing can be done.
The machines tested have fallen into one of three more or less clearly defined groups. Some machines were sequentially consistent (SC). This corresponds to an architecture of A(CMP,UPO,PO,CC1). Others obeyed A(CMP,UPO,RR,WW,RW,CC1), that is, they relaxed sequential consistency only in allowing reads to occur before logically preceding writes. Call this the POK architecture, since the IBM mainframes designed in Poughkeepsie were the first to follow this standard. The third group of machines are called processor consistent (PCon) [GOO89]. They relax both the WR and the CC1 rules. The architecture is A(CMP,UPO,RR,WW,RW,CC3). (More details on the testing can be found in the Appendix.)
The SC machines were:
A recent study [A&G96] suggests that even more relaxed behavior will be visible on the Alpha, the PowerPC, the T3D, and some other machines.
Example. Test T4: Seek a relaxation of A(CMP,PO).
Sequential consistency requires that all read and write
operations to shared data occur in the order defined by the program.
The most common relaxation of PO is to allow read operations to occur
in time before logically preceding write operations occur. Test T4
seeks to detect a relaxation of WR and of RW.
Suppose a 2-way multiprocessor executes the following code and calculates the given terminal values. The way to show that a read occurred before a logically preceding write (or even worse, that the machine relaxed the rule of computation) is to assume that the machine obeyed the rules of computation and of program order and then to deduce a contradiction.
Initially, (A,B,U,V) = (0,0,0,0). P1 P2 A = 1; B = 1; U = B; V = A; Terminally, (A,B,U,V) = (1,1,0,0).If the machine obeys program order, then in P1 the write into A occurs before the read from B.
If the machine obeys the rule of computation, then since U is zero, the read from B in P1 occurs before the write of a one value into B in P2.
If the machine obeys program order, then in P2 the write into B occurs before the read from A.
If the machine obeys the rule of computation, then since V is zero, the read from A in P2 occurs before the write of a one value into A in P1.
Therefore, a circuit in time exists among the read and write operations for the program, which is a contradiction. Thus the machine fails to obey the rule of computation and/or the rule of program order.
For such a brief program to be successful as a test case, the two processors have to be exactly synchronized. Extending the test case in the manner shown below creates a test program that can be used on a real machine.
P1 P2 A = 0; B = 0; U[0] = B; V[0] = A; A = 1; B = 1; U[1] = B; V[1] = A; A = 2; B = 2; U[2] = B; V[2] = A; A = 3; B = 3; U[3] = B; V[3] = A; . . . . . . A = k; B = k; U[k] = B; V[k] = A;If k = 200,000, then there is a much longer window during which a relaxation of the rule of program order can be detected. Note that the processors do not have to be in sync; either can be interrupted away for a long period. So long as both execute at the same time, the possibility exists for them to reveal a relaxation of the rule of program order.
A True Experience with Minisculy Distinct Architectures
An operating systems programmer for company A discovered a bug
when the system ran on a machine from company B. The bug never
occurred when the system ran on the supposedly compatible machine
produced by company A. The programmer concluded that the B machine
had a bug which was not present on the A machine.
The truth was difficult to get to. Eventually, both machines were found to obey the architecture defined for the machines. However, the A machine, as implemented, was stronger than the architecture required. The B machine implemented the architecture more accurately. The OS had not been coded to perform correctly on a machine obeying a strict implementation of the architecture, and so the fault was found to be in the OS, not in the B machine.
Multiprocessor Diagnostics
Multiprocessor Diagnostics was formed by William W. Collier in
1993 to develop and to market ARCHTEST. The development of ARCHTEST
was supported by the Mid-Hudson InfoMall section of the Northeast
Parallel Architecture Center at Syracuse University (Dennis Eaton,
Director). It is a pleasure to acknowledge this support.
ARCHTEST is available to academics under a research license which allows long-term, noncommercial use of ARCHTEST at no fee. Researchers are expected to undertake extensive use of ARCHTEST and to make the results of such use available to others, either in journals or on the web.
Commercial use of ARCHTEST requires either a site license or an entity license. A site license is $5,000 (a one-time charge); it allows use of ARCHTEST by one company in any one county in the U.S.A. (or the equivalent overseas). An entity license is $25,000 (a one-time charge). It allows use of ARCHTEST by a company anywhere in the world. Purchase orders can be faxed to 914-896-7675.
ARCHTEST is about 10,000 lines of C code. A single flag allows ARCHTEST to compile to run on Windows NT, Solaris, or UnixWare 2.0. For other systems users may have to supply code to start and to stop multithreading.
For further information see http://www.infomall.org or contact William W. Collier at Multiprocessor Diagnostics, 13 Gary Place, Wappingers Falls, New York 12590. Tel: 914-297-5901. E-mail: collier@acm.org.
Appendix
The results of testing 24 machines are summarized below. Some
number of testers who provided data were not willing to be identified.
1. The Ultra machine at New York University. N=4 T=4 K=20000 1991 M. Ayala 1 O. 2 O.. 3 O.. 4 O.. 5 O.. 6 O.. 7 O.. 8 .. 9 .. 11 ... 12 ... 2. A Solborne 3-way Sparc at the Compaq Corp. N=3 T=3 K=? 1993 D. Koenen 1 .. 2 O.. 3 ... 4 O.. 5 O.. 6 ... 7 O.. 8 .. 9 .. 11 ... 12 ... 3. An SGI Onyx at the University of Minnesota. N=4 T=4 K=20000 1994 F. Mounes-Toussi, D. Lilja, Univ. of Minnesota 1 .. 2 OO. 3 OO. 4 OO. 5 OO. 6 OO. 7 OO. 8 O. 9 .. 11 ... 12 ... 4. A 4-way Sun Sparc 630 at SUNY New Paltz. N=4 T=4 K=200000 1996 W. Collier 1 OO 2 OOO 3 OOO 4 OOO 5 OOO 6 OOO 7 OOO 8 OO 9 OO 11 OOO 12 OOO 5. A KSR-1 machine at the University of Toronto. N=8 T=8 K=500000 1996 N. Manjikian, University of Toronto 1 OO 2 OOO 3 OOO 4 OOO 5 OOO 6 OOO 7 OOO 8 OO 9 .. 11 ... 12 ... The program ran out of time before completing the tests. 6. Three IBM mainframes. N=4 T=4 K=200000 1992 W. Collier 1 O. 2 OO. 3 OO. 4 XX. 5 OO. 6 OO. 7 OO. 8 .. 9 .. 11 ... 12 ... 7. An Amdahl mainframe. N=4 T=4 K=200000 1992 W. Collier 1 O. 2 OO. 3 OO. 4 XX. 5 OO. 6 OO. 7 OO. 8 .. 9 .. 11 ... 12 ... 8. An Intergraph 2-way TD4 N=2 T=2 K=20000 1995 Anon 1 .. 2 OO. 3 ... 4 XO. 5 ... 6 ... 7 OO. 8 O. 9 .. 11 ... 12 ... 9. An ALR Revolution 2-way. N=2 T=2 K=500000 1995 Anon 1 .. 2 OO. 3 ... 4 XX. 5 ... 6 ... 7 OO. 8 O. 9 .. 11 ... 12 ... 10. An AST Manhattan V Series 5090. N=2 T=2 K=20000 1995 Anon 1 .. 2 OO. 3 ... 4 XO. 5 ... 6 ... 7 OO. 8 O. 9 .. 11 ... 12 ... 11. A Hewlett-Packard Vectra XU 5/90. N=2 T=2 K=20000 1995 Anon 1 .. 2 OO. 3 ... 4 OO. 5 ... 6 ... 7 OO. 8 O. 9 .. 11 ... 12 ... N=2 T=2 K=250000 1995 Anon 1 .. 2 OO. 3 ... 4 XX. 5 ... 6 ... 7 OO. 8 O. 9 .. 11 ... 12 ... The histograms of the test results for this machine show behavior which is only rarely and just barely over the edge in relaxed territory. 12. XBit Computer, ASUS motherboard N=2 T=2 K=200000 1995 Anon 1 OO 2 ... 3 ... 4 XX. 5 ... 6 ... 7 OO. 8 O. 9 .. 11 ... 12 ... 13. An Hitachi mainframe. N=4 T=4 K=200000 1992 W. Collier 1 O. 2 OO. 3 OO. 4 XX. 5 OO. 6 OO. 7 XX. 8 .. 9 .. 11 ... 12 ... The architecture for this machine is POK; it should not have failed Test T7. 14. An SGI Power Challenge. N=2 T=2 K=500000 1995 F. Mounes-Toussi, D. Lilja, Univ. of Minnesota 1 .. 2 OO. 3 ... 4 OX. 5 ... 6 ... 7 OX. 8 O. 9 .. 11 ... 12 ... This machine is SC except when operating on both integer and floating operands. 15. An ALR Revolution 4/100. N=4 T=4 K=400000 1995 Anon 1 .. 2 OO. 3 OO. 4 XX. 5 OO. 6 OO. 7 XX. 8 O. 9 .. 11 ... 12 ... 16. A Compaq Proliant 3-way N=3 T=3 K=500000 1995 Anon 1 .. 2 OO. 3 OO. 4 XX. 5 OO. 6 OO. 7 XX. 8 O. 9 .. 11 ... 12 ... 17. A Compaq 4000 5/66. N=R T=R K=20000 1995 Anon 1 .. 2 OO. 3 OO. 4 XX. 5 OO. 6 OO. 7 XX. 8 O. 9 .. 11 ... 12 ... 18. An NCR 2-way. N=2 T=2 K=500000 1995 Anon 1 .. 2 OO. 3 ... 4 XX. 5 ... 6 ... 7 XX. 8 O. 9 .. 11 ... 12 ... 19. An Olivetti 3-way. N=3 T=3 K=500000 1995 Anon 1 .. 2 OO. 3 ... 4 XX. 5 OO. 6 ... 7 XX. 8 O. 9 .. 11 ... 12 ... 20. A Sequent 6-way N=6 T=6 K=250000 1995 Anon 1 .. 2 OO. 3 OO2. 4 XX. 5 OO. 6 OO. 7 XX. 8 O. 9 .. 11 ... 12 ... 21. A Tricord server. N=8 T=8 K=20000 1995 M. Olson 1 .. 2 OOO 3 ... 4 XXX 5 ... 6 ... 7 XXX 8 O. 9 .. 11 ... 12 ... 22. A 2-way Sun Sparc 20 at SUNY New Paltz. N=2 T=2 K=200000 1996 W. Collier 1 OO 2 OOO 3 ... 4 XXX 5 ... 6 ... 7 XXX 8 OO 9 .. 11 ... 12 ...