On Martian Magnetic Fields October 28 2000 1) Problem Around 30,000 data points fitted to some 10,000 partial waves using least squares. Sometimes linear (Mars) and sometimes nonlinear if magnitudes and angles not components measured. Done by usual least squares equations. Matrix is 10000 by 10000 Calculating matrix is about a factor of 10 more compute time than solving it 2) Comment on Method As matrix calculation dominate, one can investigate more sophisticated solution methods which address better inevitable fact that some parameters ill determined Best method (IMHO) is to diagonalize least squares matrix and select the upper range of eigenvalues -- zeroing shifts for small eigenvalues (say those < 10**-6 of maximum) You should find more stable results and that you can freely increase number of expansion functions and always get numerically reliable answers. This approach always improves conergence of nonlinear fits As a good approximation to this, use Marquardt's method -- which is to add Lambda* Diagonal matrix to least squares matrix -- equivalent to increasing all eigenvalues by Lambda. as shift proportional to 1/eigenvalue, this ruse hardly changes shift from large eigenvalues and greatly reduces shift from unreliable small ones. You can find packages implementing either method. For Marquardt one varies Lambda to get best answer. Start with Lambda = some fraction (.01 or something) of average eigenvalue (Trace of least squares matrix divided by number of rows) 3) Parallelization can be achieved by either parallelizing over a) matrix elements b) data points For computer scientists a) is a neat algorithm but for real people b) is best as long matrix fits in memory (you need to have a copy of least squares matrix on every node) As we have 2 gigabytes per node and you have 8*(10**8) (divided by 2 for symmetry) matrix elements, I think you can fit in memory So algorithm is Initialize matrix/RHS to zero in each node; divide data points equally between nodes and calculate contribution of each node to full matrix/RHS When this is done (embarassingly parallel), one must add up and redistribute matrix in fashion needed for matrix solver. Then we use IBM or other solver library -- this is a solved problem. Redistribution time is tiny I believe 4) My advice Use a single 4 CPU IBM node or the SGI Climate machine as here shared memory implies that one does not have to worry at all about redistributing matrix One can use multiple CPU's to get parallelism over data points and invoke optimized equation solver This way you do NOT have to learn MPI which is needed for multiple node IBM use 5) My admission of Failure I know all this works in principle but I do not know how to persuade IBM or SGI compilers to recognize obvious parallelism in shared memory and avoid possible conflicts if 2 CPU's update same matrix element simultaneously. I am sure somebody has this expertise (Kyle? IBM? ....) Note in any case order of calculations changed, so matrix from parallel algorithm will differ due to rounding from that of sequential algorithm. This will not matter for your science.