To parallelize the Wolff algorithm, need to use a parallel depth-first search (``ants-in-the-labyrinth'') algorithm, since that is how the Wolff cluster is grown. Unfortunately, this is a sequential process --- each ant generation has to follow the previous one.
The only parallelism is within a generation, i.e. over the ``wavefront'' of the expanding cluster. The amount of parallelism is very limited, especially if the cluster being grown is small.
A major problem in parallelizing depth-first search methods is that the load balance can be very poor, since the computation is usually confined to a particular (contiguous) section of the data. Can sometimes alleviate this problem by using scattered or cyclic data distribution, rather than the standard block distribution. Howver there is a tradeoff --- the data is no longer local so extra communication is required.