Run the program for LCG (a=13445, c=0, m=65536, seed=16811) with sequence sizes of 10, 100, 1000, and 10000. Plot the errors in the average and variance versus sample size. Discuss your results. (Many people like gnuplot, we also have xmgr which has a tutorial in the CSEP book.)
I wrote this program using C++ so I could compile and run it on my Power Mac 8100/110 using the Metroworks Gold compiler (thus eliminating the compiler problems I have had so far).
My solution is fairly straightforward, so I won't explain much if it. I wrote a fucnction to call the random number generator for a given number of times and output the average and variance. The main routine simply calls this function for each assigned number of repititions. I ran the program using several methods for calculating the variance, and finally settled on a formula given by William H. Beyer on page 473 of the 29th edition of CRC Standard Mathematical Tables and Formulae. The results from the different methods I tried were similar enough that simply kept the last method tried (avoiding reprogramming!)
The output from this program was actually better than I expected. The average values were all very close to 0.5 as one would expect. Initially I expected the variances to be higher, but after analyzing what the variance really means they are about where I would expect them. The standard deviation is the square root of the variance. The variances for the higher order repetitions were very close to .08 yielding a standard deviation of about .28. If the distribution were gaussian one would expect about 68% of the values to fall within one standard deviation from the mean--within the range of .22 to .78 or a total width of .56. If the distribution were flat we would expect about 56% of the values to lie within this range. Statistically one would expect 94% of the values to lie within a distance of two standard deviations or a total width of 1.12. We know, however that all the values were within a range of 1 (0 to 1). From this I conclude that the distribution I achieved with the random number generator was smaller than gaussian in the middle and greater than gaussian on the extrema. This is exactly what one would expect from a flat distribution.
Here is a graphic display of the errors in the averages and the variations which I generated using IGOR.
Here is my solution and here is the output from my solution. I found that for any seed the cycle was from one to four numbers long, so I printed the first four numbers from each series. That there were no series longer than four was suprising to me, at first, but There are a total of 8 independent series, from seeds of 0,1,2,3,4,6,8, and 12. There is a difference between the even and odd seeds. All of the odd seeds produce a series with a period of 4. The even seeds which are multiples of 4 produce series with a period of 1 while the other even seeds produce series of period 2. I have written separate pages to explain each of these facts:
Why seeds that are multiples of four have a period of 1, Why even seeds have a period of 2, Why any seed has a period of 4. In these explanations I have used the notation R(i) i=0, 1, 2, ... to be the ith term in the series of random numbers from a specified seed.
LCG ( a=66, c=0, m=2**31, seed=0) for 4,096 random numbers.
LCG ( a=5, c=0, m=2**31-1, seed=0) for 4,096 random numbers.
Discuss why these are poor (or not) generators.
Click here to view my modified code. If you wish to sift through the 8k data points click here to see the output from my code. I also made the graphs, available as jpegs the first part here and the second part here.
It is not obvious from the first graph that the first generator was awful, but upon viewing the data the awfulness of the generator becomes apparent. No matter what seed I used (I chose several, the first ones were rather large prime numbers, but after a while I just chose really larg odd numbers) the generator settled to zero after fifty terms or so. Once it hit zero, that's all that followed, so althoug the plot doesn't look so bad, if you consider that there are 4000+ dots at the origin, it is clear that the first generator is awful. I thought at first that the zeroing out of the series was due to the fact that my C++ compiler uses 4 byte integers, but I checked out the binary arithmetic and the lower 4 bytes work out fine, the upper 4 bytes (of a 64 bit result) are ignored. Thus the compiler is automatically doing the modulus for results of more than 32 bits. I tried the generator with different numbers for a and the results were fine, so I have to conclude that 5 is a special case.
The lines that appear in the second graph are very predictable. They all have a slope of 5, which by no coincidence, is the value of a. This graph illustrates why it is essential not only to have a large value for a, but a large value for c also, as the addition and modulus functions would have broken up this pattern better if c had been non-zero, and the larger c had been, the more the pattern would have been disturbed.
Obviously there is a high degree of corellation in both of these random number generators, and they should not be used!