Fast Fourier Transforms on a Beowulf - a Case Study

Fast Fourier transforms (FFT) can be used to test the performance of a Beowulf cluster, because they involve substantial communication between nodes. These tests also allow for a comparison of various MPI distributions. The results given below were obtained on a Beowulf with a switched 100BaseT network (switched fast ethernet).

Only two-dimensional transforms have been tested here. However, higher dimensional FFTs are believed to show similar results. FFTs on a cluster are usually done in the following way:

  1. Do the column transforms. These are local to the nodes, hence no communication between nodes is necessary.
  2. Do a 2d matrix transpose. This involves communication between all pairs of nodes.
  3. Do the row transforms. These are now local to the nodes, hence no communication between nodes is necessary.
The FFTW library (www.fftw.org) uses this method and implements step 2. using the MPI_Alltoall routine of the underlying MPI library. The results for this algorithm are listed under the column "fftw". Three MPI libraries were tested: mpich-1.2.1 (www-unix.mcs.anl.gov/mpi/mpich), mpipro-1.5b7-3t (www.mpi-softtech.com/products/mpi_pro_linux), and lam-6.3.2 (www.mpi.nd.edu/lam).

A second algorithm uses the same scheme, but instead of using MPI_Alltoall in the matrix transpose routine, it first defines a new MPI data type using MPI_Type_vector that contains the block of data of size (Lx/np)*(Ly/np) that must be sent to other processors [Lx*Ly is the system size, np is the # of processors]. Then MPI_Irecv and MPI_Isend are used to send the data to all other processors in the matrix transpose routine (i.e., np(np-1) MPI_Irecv and MPI_Isend calls are necessary). For the 1d column and row transforms the fftw routine is used. The results for this algorithm are given under the column "fft2d-i".

Both algorithms described above separate cpu intensive parts and communication intensive parts completely: parts (1) and (3) are without any communication, whereas part (2), the matrix transpose involves communication only. This can be a distinct disadvantage, if the computer architecture and MPI distribution allow for efficient parallel computation and message passing on a single CPU. With this in mind the following algorithm was implemented:

  1. Within a loop over the columns that are local to the processor
    a)do the 1d FFT,
    b)send/receive the obtained data to/from the other processors.
    Note, that now only Ly/np complex data are transfered in each send/receive operation. However, step b) can be implemented using persistent sends and receives.
  2. Do the 2d matrix transpose. This does not require any communication, since all data are already in place.
  3. Do the row transforms as under step 3. above.
The results for this algorithm are listed under "fft2d-p".

The source code of all three test programs can be downloaded from here.

Results

Results are give for three different system sizes, 1600x1600, 800x800, and 400x400. Times (in seconds) are given for nfft forward and backward transforms. The cpu-time for a single cpu (without MPI using the fftwnd routine) is given as a reference. All results are obtained on a cluster of dual PIII/500MHz boxes. For np=2 results are given for the two processes on different nodes and for the two processes on the same node (the latter in brackets). In each case the best result is emphasized in red. All times were measured using MPI_Wtime. The uncertainty of all results is about ±1s.
system size LxL = 1600x1600, nfft=6
===================================

np=1   42.65

                np  fftw           fft2d-i        fft2d-p
mpich-1.2.1      2  43.10 (29.19)  43.14 (27.48)  43.14 (26.37)
mpipro-1.5b7-3t  2  36.51 (32.22)  32.40 (28.84)  32.12 (28.81)
lam-6.3.2        2  42.70 (34.90)  35.17 (27.88)  29.01 (24.73)

mpich-1.2.1      4  31.09          31.69          31.90
mpipro-1.5b7-3t  4  27.15          23.93          23.13
lam-6.3.2        4  27.74          25.13          19.64

mpich-1.2.1      8  24.49          23.76          25.21
mpipro           8  17.02          17.86          15.40
lam-6.3.2        8  16.21          16.76          13.34


system size LxL = 800x800, nfft=24
==================================

np=1   33.11

                np  fftw           fft2d-i        fft2d-p
mpich-1.2.1      2  41.18 (25.52)  40.04 (24.59)  30.53 (23.47)
mpipro-1.5b7-3t  2  35.14 (29.26)  29.86 (25.41)  33.25 (27.25)
lam-6.3.2        2  40.68 (30.89)  33.12 (24.61)  30.33 (21.72)

mpich-1.2.1      4  31.69          34.17          26.07
mpipro-1.5b7-3t  4  26.08          24.70          25.42
lam-6.3.2        4  25.17          23.24          21.10

mpich-1.2.1      8  23.96          24.72          26.17
mpipro-1.5b7-3t  8  16.38          17.04          19.69
lam-6.3.2        8  15.60          16.07          13.83


system size LxL = 400x400, nfft=96
==================================

np=1   27.63

                np  fftw           fft2d-i        fft2d-p
mpich-1.2.1      2  38.38 (22.05)  37.79 (23.26)  36.14 (22.44)
mpipro-1.5b7-3t  2  32.03 (25.28)  28.29 (22.77)  38.50 (28.44)
lam-6.3.2        2  36.76 (28.02)  35.28 (22.33)  35.25 (19.26)

mpich-1.2.1      4  29.60          32.18          33.28
mpipro-1.5b7-3t  4  21.36          24.04          31.78
lam-6.3.2        4  17.83          22.59          21.47

mpich-1.2.1      8  14.35          14.92          38.96
mpipro-1.5b7-3t  8  13.95          15.15          26.00
lam-6.3.2        8  18.06          14.29          16.23

Discussion

The results above differ significantly for the three tested MPI distributions. because of different latencies and transfer rates. The algorithms "fftw" and "fft2d-i" use messages of size (L/np)². Even for the smallest size L=400 tested here this is a large message, meaning that the throughput is dominated by the transfer rate, not by the latency. This is seen most clearly for the largest system size 1600x1600 where the results for "fftw" and "fft2d-i" are basically the same.

However, the algorithm "fft2d-p" uses messages of size L/np, which are small in the sense that the transfer time is dominated by latency. It seems that the lam MPI implementation has a much lower latency then the other two distributions and therefore the best results for the "fft2d-p" algorithm were obtained with lam-6.3.2. For the larger system sizes this combination (fft2d-p + lam) gave the fastest FFTs. This is probably also the best algorithm for higher dimensional FFTs.

It should be emphasized that from these result one cannot draw conclusions about which MPI distribution is "best". The results are valid for this particular problem only and depend sensitively on the algorithm. Actually, the results indicate that for sending large messages the conclusion may be different.

Also, the whole picture should change for transfer rates higher than 100BaseT. It is unclear which is the best algorithm in such cases and the whole study must be repeated.