In this set of tests, we run three values for N (the
number of integration points) for different values
on np = 1 2 4 8 and 16 MPI processes
These were executed on 1 or 4 Real computers

We did a few runs where we looped 10 times over calculation 
of pi and this allowed greater accuracy in timing and
allowed one to distinguish MPI communication overhead
from start up time of LAM which is about 0.5 seconds
In these runs LAM was only started once -- these runs
are denoted by (MPI*10) below.
-----------


Tests were done for N=10,000; 1,000,000; and 10,000,000 per MPI process.
and these are termed Small, Medium and Large cases below

One Real Computer
------------------
   N=	Small	Medium	Large	Med-Sml	Large-Small	 Last
np
1	.505	.665	2.056	0.16	1.551
1(again) 0.5	0.66	2.052	0.16	1.552
2	.505	.683	2.146	0.178	1.641		1.
4	.513	.78	3.202	0.267	2.689		0.82
8	.533	1.055	6.143	0.522	5.61		0.85
16	.573	1.779	11.283	1.206	10.71		0.82
16(redo).568	1.674	11.58	1.106	11.012		0.84

There were two runs where MPI computations looped over 10 times
4(MPI*10).551	3.354	28.167	2.803	27.616
Redo above.548	3.29	28.554	2.742	28.006

When distributing the calculation on multiple MPI nodes, we are usually interested in 
how performance efficiency behaves as we increase the number of processors.  This
is done in the "Last" column, which compares the speed-up ratio of multiple nodes to the 
np=2 case. We choose np=2 as special as Gridfarm computers have two
CPU's each and some special hardware to allow multiple threads to
reun efficiently so they act as though they have more than 2.

In final column:
Last = (large-Small) * 2 /(large-small for np=2)) * np
Here this is only interesting np = 4 8 16
As < 1 says that each CPU runs faster than expected as time
is this factor smaller than extected for naive linear formula after
you get more processes per CPU than 2
i.e. expect Time proportional to Max(1, Process per CPU divided by 2)

Four Real Computers
--------------------
   N=	Small	Medium	Large	Med-Sml	Large-Small	 Last
np	
4	0.543	0.698	2.291	0.155	1.748
8	0.548	0.733	2.26	0.185	1.712		1.
16	0.587	0.881	3.568	0.294	2.981 		0.87
4(MPI*10)0.555	2.267	17.899	1.712	17.344

We again want to measure efficiency as the number of nodes increases. For this case,
we use 4 computers in our bhost file, so our basis for comparison is the np=8 case,
in which we will have 2 MPI calculation groups per computer.

Last = (large-Small) * 8 /(large-small for np=8)) * np
Here this is only interesting np = 16

Deduce
------
"Small" includes LAM start-up and MPI communication
Can separate these two with the set of runs that go
through pi calculation multiple (10) times inside
the main routine

With NPROC real computers, one runs in scaling fashion
(time independent of number of processes) with upto 2NPROC processes
(that is upto 2 Processes per CPU which is reasonable for 2 CPU machines)

4 MPI processes per CPU run from 13% to 18% faster than you would expect


Within accuracy performance scales with N after subtracting "zero" N
Note measurements better when we increase time by looping over 10 pi calculations
after MPI initialized

Note cannot increase N much more else get rounding error as step size too small

Looking at Small N results versus np shows MPI communication not large
(.04 for np=4 4 processors versus np=1 processors)
You can divide Large-Small by 1000 or Medium-Small by 100 to estimate
amount of real floating computiung included in Small results. It is
0.002 out of 0.5 to 0.6 measured values.