Benchmark Overview¶

CUDA Benchmarks¶

Hardware: 10 nodes, each with 4x NVIDIA Volta V100 GPUs (32 GB HBM2), connected via 56G InfiniBand.
Software: NVHPC 24.7, CUDA 11.8.
Libraries: dtFFT v3.0.0, cuDECOMP v0.6.0, HeFFTe v2.4.1, and 2DECOMP&FFT latest from GitHub.
Problem Size: A 3D grid of \(1024 \times 1024 \times 1024\) was used for all tests.
Methodology: Each benchmark was run for 50 iterations performing both forward and backward transforms. The reported time is the maximum time taken by any GPU over all iterations.
Precision: Double precision complex-to-complex (C2C) transforms were performed.
Communication Backend: Only MPI-based communication was evaluated for multi-GPU tests.
FFT Libraries: All libraries utilized NVIDIA’s cuFFT for local FFT computations.
Additional information: UCX and CUDA IPC are disabled.

Strong scaling results¶
Number of GPUs	dtFFT	cuDECOMP	HeFFTe	2DECOMP&FFT
4	119626.641	121721.242	121098.586	120167.854
8	204371.359	204187.656	203894.516	204669.592
12	167539.578	168842.437	168267.093	167885.206
16	137827.844	138721.703	138234.469	138201.588
20	117795.039	119350.664	118823.680	117924.865
24	101455.992	103170.719	102516.188	101612.259
28	89294.328	91382.023	90897.547	89335.099
32	79008.461	81047.414	80815.453	79230.639
36	72392.461	73923.062	73583.375	72661.824

Grid decomposition created via MPI_Dims_create.

C2C Double Precision; initial grid is \(128 \times 128 \times 128\); no FFT is executed

C2C Double Precision; \(64 \times 2048 \times 2048\); FFTW3 executor

C2C Double Precision; \(2048 \times 2048 \times 64\); FFTW3 executor

C2C Double Precision; \(1024 \times 1024 \times 1024\); FFTW3 executor. Grid created via MPI_Dims_create.