In this set of tests, we run three values for N (the number of integration points) for different values on np = 1 2 4 8 and 16 MPI processes These were executed on 1 or 4 Real computers We did a few runs where we looped 10 times over calculation of pi and this allowed greater accuracy in timing and allowed one to distinguish MPI communication overhead from start up time of LAM which is about 0.5 seconds In these runs LAM was only started once -- these runs are denoted by (MPI*10) below. ----------- Tests were done for N=10,000; 1,000,000; and 10,000,000 per MPI process. and these are termed Small, Medium and Large cases below One Real Computer ------------------ N= Small Medium Large Med-Sml Large-Small Last np 1 .505 .665 2.056 0.16 1.551 1(again) 0.5 0.66 2.052 0.16 1.552 2 .505 .683 2.146 0.178 1.641 1. 4 .513 .78 3.202 0.267 2.689 0.82 8 .533 1.055 6.143 0.522 5.61 0.85 16 .573 1.779 11.283 1.206 10.71 0.82 16(redo).568 1.674 11.58 1.106 11.012 0.84 There were two runs where MPI computations looped over 10 times 4(MPI*10).551 3.354 28.167 2.803 27.616 Redo above.548 3.29 28.554 2.742 28.006 When distributing the calculation on multiple MPI nodes, we are usually interested in how performance efficiency behaves as we increase the number of processors. This is done in the "Last" column, which compares the speed-up ratio of multiple nodes to the np=2 case. We choose np=2 as special as Gridfarm computers have two CPU's each and some special hardware to allow multiple threads to reun efficiently so they act as though they have more than 2. In final column: Last = (large-Small) * 2 /(large-small for np=2)) * np Here this is only interesting np = 4 8 16 As < 1 says that each CPU runs faster than expected as time is this factor smaller than extected for naive linear formula after you get more processes per CPU than 2 i.e. expect Time proportional to Max(1, Process per CPU divided by 2) Four Real Computers -------------------- N= Small Medium Large Med-Sml Large-Small Last np 4 0.543 0.698 2.291 0.155 1.748 8 0.548 0.733 2.26 0.185 1.712 1. 16 0.587 0.881 3.568 0.294 2.981 0.87 4(MPI*10)0.555 2.267 17.899 1.712 17.344 We again want to measure efficiency as the number of nodes increases. For this case, we use 4 computers in our bhost file, so our basis for comparison is the np=8 case, in which we will have 2 MPI calculation groups per computer. Last = (large-Small) * 8 /(large-small for np=8)) * np Here this is only interesting np = 16 Deduce ------ "Small" includes LAM start-up and MPI communication Can separate these two with the set of runs that go through pi calculation multiple (10) times inside the main routine With NPROC real computers, one runs in scaling fashion (time independent of number of processes) with upto 2NPROC processes (that is upto 2 Processes per CPU which is reasonable for 2 CPU machines) 4 MPI processes per CPU run from 13% to 18% faster than you would expect Within accuracy performance scales with N after subtracting "zero" N Note measurements better when we increase time by looping over 10 pi calculations after MPI initialized Note cannot increase N much more else get rounding error as step size too small Looking at Small N results versus np shows MPI communication not large (.04 for np=4 4 processors versus np=1 processors) You can divide Large-Small by 1000 or Medium-Small by 100 to estimate amount of real floating computiung included in Small results. It is 0.002 out of 0.5 to 0.6 measured values.