Benchmark Overview

CUDA Benchmarks

  • Hardware: 10 nodes, each with 4x NVIDIA Volta V100 GPUs (32 GB HBM2), connected via 56G InfiniBand.

  • Software: NVHPC 24.7, CUDA 11.8.

  • Libraries: dtFFT v3.0.0, cuDECOMP v0.6.0, HeFFTe v2.4.1, and 2DECOMP&FFT latest from GitHub.

  • Problem Size: A 3D grid of \(1024 \times 1024 \times 1024\) was used for all tests.

  • Methodology: Each benchmark was run for 50 iterations performing both forward and backward transforms. The reported time is the maximum time taken by any GPU over all iterations.

  • Precision: Double precision complex-to-complex (C2C) transforms were performed.

  • Communication Backend: Only MPI-based communication was evaluated for multi-GPU tests.

  • FFT Libraries: All libraries utilized NVIDIA’s cuFFT for local FFT computations.

  • Additional information: UCX and CUDA IPC are disabled.

Strong Scaling

Strong scaling performance
Strong scaling results

Number of GPUs

dtFFT

cuDECOMP

HeFFTe

2DECOMP&FFT

4

119626.641

121721.242

121098.586

120167.854

8

204371.359

204187.656

203894.516

204669.592

12

167539.578

168842.437

168267.093

167885.206

16

137827.844

138721.703

138234.469

138201.588

20

117795.039

119350.664

118823.680

117924.865

24

101455.992

103170.719

102516.188

101612.259

28

89294.328

91382.023

90897.547

89335.099

32

79008.461

81047.414

80815.453

79230.639

36

72392.461

73923.062

73583.375

72661.824

Bricks Strong Scaling

Grid decomposition created via MPI_Dims_create.

Bricks Strong Scaling

Host Benchmarks

Weak Scaling

C2C Double Precision; initial grid is \(128 \times 128 \times 128\); no FFT is executed

Bricks Strong Scaling

Strong Scaling

C2C Double Precision; \(64 \times 2048 \times 2048\); FFTW3 executor

dtFFT Strong Scaling Competitors Strong Scaling

C2C Double Precision; \(2048 \times 2048 \times 64\); FFTW3 executor

dtFFT Strong Scaling narrow Competitors Strong Scaling narrow

Bricks Strong Scaling

C2C Double Precision; \(1024 \times 1024 \times 1024\); FFTW3 executor. Grid created via MPI_Dims_create.

Competitors Strong Scaling narrow