SA can be easily parallelized on a coarse grain machine by using independent parallelism, that is, using a different random initial condition and different random number streams on each processor. Each processor will find an independent local minima, and we choose the smallest of these as our best solution. Alternatively, we may want to pass information between processors during the annealing, for example, to replicate ``good'' configurations.
For large problems, we may just want a good approximate solution to be generated quickly, so we parallelize the problem. For example, we coulc use domain decomposition for the Ising spin glass, or map a TSP tour onto a ring of processors (see Fox et al. books).