On Martian Magnetic Fields October 28 2000
1) Problem
   Around 30,000 data points fitted to some 10,000 partial waves using
   least squares. Sometimes linear (Mars) and sometimes nonlinear if
   magnitudes and angles not components measured. Done by usual least squares
   equations. Matrix is 10000 by 10000
   Calculating matrix is about a factor of 10 more compute time than solving it
 
2) Comment on Method
   As matrix calculation dominate, one can investigate more sophisticated solution
   methods which address better inevitable fact that some parameters ill determined
   Best method (IMHO) is to diagonalize least squares matrix and select the 
   upper range of eigenvalues
   -- zeroing shifts for small eigenvalues (say those < 10**-6 of maximum)
   You should find more stable results and that you can freely increase number of
   expansion functions and always get numerically reliable answers.
   This approach always improves conergence of nonlinear fits
   As a good approximation to this, use Marquardt's method -- which is to add Lambda*
   Diagonal matrix to least squares matrix -- equivalent to increasing all eigenvalues
   by Lambda. as shift proportional to 1/eigenvalue, this ruse hardly changes shift from large
   eigenvalues and greatly reduces shift from unreliable small ones.
   You can find packages implementing either method. For Marquardt one varies
   Lambda to get best answer. Start with Lambda = some fraction (.01 or something)
   of average eigenvalue (Trace of least squares matrix divided by number of rows)
   
3) Parallelization can be achieved by either parallelizing over
   a) matrix elements
   b) data points
   For computer scientists a) is a neat algorithm but for real people b) is best
   as long matrix fits in memory (you need to have a copy of least squares
   matrix on every node)
   As we have 2 gigabytes per node and you have 8*(10**8) (divided by 2 for symmetry)
   matrix elements, I think you can fit in memory
   So algorithm is
   Initialize matrix/RHS to zero in each node; divide data points equally between
   nodes and calculate contribution of each node to full matrix/RHS
   When this is done (embarassingly parallel), one must add up and redistribute
   matrix in fashion needed for matrix solver. Then we use IBM or other solver
   library -- this is a solved problem. Redistribution time is tiny I believe
 
 
4) My advice
   Use a single 4 CPU IBM node or the SGI Climate machine as here shared memory
   implies that one does not have to worry at all about redistributing matrix
   One can use multiple CPU's to get parallelism over data points and invoke
   optimized equation solver 
   This way you do NOT have to learn MPI which is needed for multiple node
   IBM use

5) My admission of Failure
   I know all this works in principle but I do not know how to persuade IBM
   or SGI compilers to recognize obvious parallelism in shared memory and
   avoid possible conflicts if 2 CPU's update same matrix element simultaneously.
   I am sure somebody has this expertise (Kyle? IBM? ....)
   
   Note in any case order of calculations changed, so matrix from parallel
   algorithm will differ due to rounding from that of sequential algorithm.
   This will not matter for your science.