Only two-dimensional transforms have been tested here. However, higher dimensional FFTs are believed to show similar results. FFTs on a cluster are usually done in the following way:

- Do the column transforms. These are local to the nodes, hence no communication between nodes is necessary.
- Do a 2d matrix transpose. This involves communication between all pairs of nodes.
- Do the row transforms. These are now local to the nodes, hence no communication between nodes is necessary.

A second algorithm uses the same scheme, but instead of using MPI_Alltoall in the matrix transpose routine, it first defines a new MPI data type using MPI_Type_vector that contains the block of data of size (Lx/np)*(Ly/np) that must be sent to other processors [Lx*Ly is the system size, np is the # of processors]. Then MPI_Irecv and MPI_Isend are used to send the data to all other processors in the matrix transpose routine (i.e., np(np-1) MPI_Irecv and MPI_Isend calls are necessary). For the 1d column and row transforms the fftw routine is used. The results for this algorithm are given under the column "fft2d-i".

Both algorithms described above separate cpu intensive parts and
communication intensive parts completely: parts (1) and (3) are without
any communication, whereas part (2), the matrix transpose involves
communication only. This can be a distinct disadvantage, if the computer
architecture and MPI distribution allow for efficient parallel computation
and message passing *on a single CPU*. With this in mind the
following algorithm was implemented:

- Within a loop over the columns that are local to the processor

a)do the 1d FFT,

b)send/receive the obtained data to/from the other processors.

Note, that now only Ly/np complex data are transfered in each send/receive operation. However, step b) can be implemented using persistent sends and receives. - Do the 2d matrix transpose. This does not require any communication, since all data are already in place.
- Do the row transforms as under step 3. above.

The source code of all three test programs can be downloaded from here.

system size LxL = 1600x1600, nfft=6 =================================== np=1 42.65 np fftw fft2d-i fft2d-p mpich-1.2.1 2 43.10 (29.19) 43.14 (27.48) 43.14 (26.37) mpipro-1.5b7-3t 2 36.51 (32.22) 32.40 (28.84) 32.12 (28.81) lam-6.3.2 2 42.70 (34.90) 35.17 (27.88) 29.01 (24.73) mpich-1.2.1 4 31.09 31.69 31.90 mpipro-1.5b7-3t 4 27.15 23.93 23.13 lam-6.3.2 4 27.74 25.13 19.64 mpich-1.2.1 8 24.49 23.76 25.21 mpipro 8 17.02 17.86 15.40 lam-6.3.2 8 16.21 16.76 13.34 system size LxL = 800x800, nfft=24 ================================== np=1 33.11 np fftw fft2d-i fft2d-p mpich-1.2.1 2 41.18 (25.52) 40.04 (24.59) 30.53 (23.47) mpipro-1.5b7-3t 2 35.14 (29.26) 29.86 (25.41) 33.25 (27.25) lam-6.3.2 2 40.68 (30.89) 33.12 (24.61) 30.33 (21.72) mpich-1.2.1 4 31.69 34.17 26.07 mpipro-1.5b7-3t 4 26.08 24.70 25.42 lam-6.3.2 4 25.17 23.24 21.10 mpich-1.2.1 8 23.96 24.72 26.17 mpipro-1.5b7-3t 8 16.38 17.04 19.69 lam-6.3.2 8 15.60 16.07 13.83 system size LxL = 400x400, nfft=96 ================================== np=1 27.63 np fftw fft2d-i fft2d-p mpich-1.2.1 2 38.38 (22.05) 37.79 (23.26) 36.14 (22.44) mpipro-1.5b7-3t 2 32.03 (25.28) 28.29 (22.77) 38.50 (28.44) lam-6.3.2 2 36.76 (28.02) 35.28 (22.33) 35.25 (19.26) mpich-1.2.1 4 29.60 32.18 33.28 mpipro-1.5b7-3t 4 21.36 24.04 31.78 lam-6.3.2 4 17.83 22.59 21.47 mpich-1.2.1 8 14.35 14.92 38.96 mpipro-1.5b7-3t 8 13.95 15.15 26.00 lam-6.3.2 8 18.06 14.29 16.23

However, the algorithm "fft2d-p" uses messages of size L/np, which are
*small* in the sense that the transfer time is dominated by
latency. It seems that the lam MPI implementation has a much lower latency
then the other two distributions and therefore the best results for the
"fft2d-p" algorithm were obtained with lam-6.3.2. For the larger system
sizes this combination (fft2d-p + lam) gave the fastest FFTs. This is
probably also the best algorithm for higher dimensional FFTs.

It should be emphasized that from these result one cannot draw conclusions about which MPI distribution is "best". The results are valid for this particular problem only and depend sensitively on the algorithm. Actually, the results indicate that for sending large messages the conclusion may be different.

Also, the whole picture should change for transfer rates higher than 100BaseT. It is unclear which is the best algorithm in such cases and the whole study must be repeated.