Usage Guide¶
This guide provides a comprehensive overview of using the dtFFT library to perform parallel data transpositions and optionally
Fast Fourier Transforms (FFTs) across host and GPU environments.
Designed for high-performance computing, dtFFT simplifies the process of decomposing multidimensional data, managing memory,
and executing transformations by integrating with external FFT libraries or operating in transpose-only mode.
Whether targeting CPU clusters with MPI or GPU-accelerated systems with CUDA, this library offers flexible configuration options to
optimize performance for specific use cases. The following sections detail key aspects of working with dtFFT, from plan creation to
execution and resource management, with practical examples in Fortran, C, and C++.
Error Handling and Macros¶
Almost all dtFFT functions return error codes to indicate whether execution was successful. These codes help users identify and handle issues during plan creation, memory allocation, execution, and finalization. The error handling mechanism differs slightly across language APIs:
Fortran API: Functions include an optional
error_codeparameter (typeinteger(int32)), always positioned as the last argument. If omitted, errors must be checked through other means, such as program termination or runtime assertions.C API: Functions return a value of type
dtfft_error_t, allowing direct inspection of the result.C++ API: Functions return
dtfft::Error, typically used with exception handling or explicit checks.
To simplify error checking, dtFFT provides predefined macros that wrap function calls and handle error codes automatically:
Fortran: The
DTFFT_CHECKmacro, defined indtfft.f03, checks theerror_codeand halts execution with an informative message if an error occurs. Include this header with#include "dtfft.f03"to use it.C: The
DTFFT_CALLmacro wraps function calls, checks the returneddtfft_error_t, and triggers an appropriate response (printing an error message and exiting) if the call fails.C++: The
DTFFT_CXX_CALLmacro similarly wraps calls, throws C++ exception and displays an error message.
Below is an example demonstrating error handling with these macros:
#include "dtfft.f03"
...
call plan%execute(a, b, DTFFT_EXECUTE_FORWARD, error_code=error_code)
DTFFT_CHECK(error_code) ! Halts if error_code != DTFFT_SUCCESS
...
#include <dtfft.h>
...
DTFFT_CALL( dtfft_execute(plan, a, b, DTFFT_EXECUTE_FORWARD, NULL) )
...
#include <dtfft.hpp>
...
DTFFT_CXX_CALL( plan.execute(a, b, dtfft::Execute::FORWARD, nullptr) );
Error codes are defined in the API sections (e.g., DTFFT_SUCCESS, DTFFT_ERROR_INVALID_TRANSPOSE_TYPE). Refer to the Fortran, C, and C++ API documentation for a complete list and detailed descriptions.
Plan Creation¶
There are two ways to create plan in dtFFT.
First one involves specifying global dimensions of the data and discussed in this section along with other options such as the MPI communicator, precision, and FFT executor type.
Second one involves specifing local portion of the data on each MPI process and discussed in the Plan Creation Using Pencil Decomposition section below.
There are three plan types supported by the library:
Real-to-Real (R2R)
Complex-to-Complex (C2C)
Real-to-Complex (R2C)
Note
The Real-to-Complex plan is not available in the API if the library was not compiled with any FFT support.
Each is tailored to specific transformation needs.
Plans are created using the create method or corresponding constructor, as detailed in the Fortran, C, and C++ API sections.
For every plan type, an MPI communicator can be specified to define the process distribution (see Grid Decomposition below).
The optimization level applied during plan creation can be controlled via the effort parameter (see Selecting plan effort below). Additional parameters include:
Precision: Controlled by
dtfft_precision_twith the following options:DTFFT_SINGLE: Single precisionDTFFT_DOUBLE: Double precision
FFT Executor: Specified via
dtfft_executor_tto determine the external FFT library or transpose-only mode, with the following options:DTFFT_EXECUTOR_NONE:Transpose-only(no FFT)DTFFT_EXECUTOR_FFTW3: FFTW3 (host only, available if compiled with FFTW3 support)DTFFT_EXECUTOR_MKL: MKL DFTI (host only, available if compiled with MKL support)DTFFT_EXECUTOR_CUFFT: cuFFT (GPU only, available if compiled with CUDA support)DTFFT_EXECUTOR_VKFFT: VkFFT (GPU only, available if compiled with VkFFT support)
Additional optional settings can be specified before plan creation using dtfft_config_t (see Setting Additional Configurations below),
allowing users to customize behavior such as Z-slab optimization or GPU backend selection.
The following example creates a 3D C2C double-precision transpose-only plan:
#include "dtfft.f03"
! dtfft.f03 contains macro DTFFT_CHECK
use iso_fortran_env
use dtfft
use mpi ! or use mpi_f08
type(dtfft_plan_c2c_t) :: plan
integer(int32) :: dims(3)
integer(int32) :: error_code
type(dtfft_effort_t) :: effort = DTFFT_PATIENT
type(dtfft_precision_t) :: precision = DTFFT_DOUBLE
type(dtfft_executor_t) :: executor = DTFFT_EXECUTOR_NONE
call MPI_Init()
! Set dimensions
dims = [32, 32, 32]
! Creating plan with create method
call plan%create(dims, MPI_COMM_WORLD, precision, effort, executor, error_code)
DTFFT_CHECK(error_code)
#include <dtfft.h>
#include <mpi.h>
int main(int argc, char *argv[]) {
dtfft_plan_t plan;
int32_t dims[3] = {32, 32, 32};
MPI_Init(&argc, &argv);
// Creating plan
DTFFT_CALL( dtfft_create_plan_c2c(3, dims, MPI_COMM_WORLD, DTFFT_DOUBLE, DTFFT_PATIENT, DTFFT_EXECUTOR_NONE, &plan) );
return 0;
}
#include <dtfft.hpp>
#include <mpi.h>
#include <vector>
int main(int argc, char *argv[]) {
MPI_Init(&argc, &argv);
const std::vector<int32_t> dims = {32, 32, 32};
dtfft::Precision precision = dtfft::Precision::DOUBLE;
dtfft::Effort effort = dtfft::Effort::PATIENT;
dtfft::Executor executor = dtfft::Executor::NONE;
// Creating plan with constructor
dtfft::PlanC2C plan(dims, MPI_COMM_WORLD, precision, effort, executor);
// OR use generic interface
// dtfft::PlanC2C plan(dims.size(), dims.data(), MPI_COMM_WORLD, precision, effort, executor);
// OR use Plan pointer
// dtfft::Plan *plan = new dtfft::PlanC2C(dims, MPI_COMM_WORLD, precision, effort, executor);
return 0;
}
Grid Decomposition¶
dtFFT decomposes multidimensional data into a grid to distribute it across MPI processes for parallel execution.
The decomposition strategy depends on the global dimensions NX × NY × NZ (in Fortran order), the number of MPI
processes P, and the provided communicator.
Default Behavior¶
- When the communicator passed during plan creation is
MPI_COMM_WORLDwithPprocesses,dtFFTattempts the following steps in order: If
P <= NZ(andNZ / P >= 16for the GPU version), split the grid asNX × NY × NZ / P. This distributes the Z-dimension acrossPprocesses. Division need not be even, and the local size per process may vary slightly.If the Z-split fails (e.g.,
P > NZorNZ / P < 16on GPU), attemptNX × NY / P × NZ. This distributes the Y-dimension acrossPprocesses, providedNX <= Pto ensure compatibility with future transpositions (e.g., X-to-Y).If both attempts fail,
dtFFTconstructs a 3D communicator by fixing the X-dimension split to 1 and usingMPI_Dims_create(P, 2, dims)to balance the remainingPprocesses across Y and Z, resulting inNX × NY / P1 × NZ / P2(whereP1 × P2 = P).If this 3D decomposition is not viable (e.g.,
NY < P1orNZ < P2),dtFFTwill continue, but warning message will be printed. Make sure that DTFFT_ENABLE_LOG is set in order to see it.
User-Controlled Decomposition¶
- Users can specify a custom MPI communicator with grid topology attached. Its grid dimensions must be defined in Fortran order (X, Y, Z):
1D Communicator: A one-dimensional communicator with
Pprocesses splits the grid asNX × NY × NZ / P, distributing the Z-dimension acrossPprocesses.2D Communicator: A two-dimensional communicator with topology
P1 × P2(whereP1 * P2 = P) decomposes the grid asNX × NY / P1 × NZ / P2, splitting Y byP1and Z byP2while keeping X indivisible.3D Communicator: A three-dimensional communicator with topology
P0 × P1 × P2(whereP0 * P1 * P2 = P) is supported, butP0(X-dimension split) must be 1 to preserve the fastest-varying dimension. This results inNX × NY / P1 × NZ / P2. Violating this condition triggersDTFFT_ERROR_INVALID_COMM_FAST_DIM.
Z-Slab Optimization¶
When the grid is decomposed as NX × NY × NZ / P (e.g., via a 1D communicator or the first default step), the Z-slab optimization
becomes available. If enabled, it reduces the number of network data transfers by employing a two-dimensional FFT algorithm during
calls to the execute() method. This also enables the use of DTFFT_TRANSPOSE_X_TO_Z and DTFFT_TRANSPOSE_Z_TO_X in
the transpose() method, while all other transpose types (e.g., DTFFT_TRANSPOSE_X_TO_Y, DTFFT_TRANSPOSE_Y_TO_Z)
remain available to the user.
This optimization can be disabled by passing the appropriate parameter in dtfft_config_t (see configuration details below) or by providing
DTFFT_ENABLE_Z_SLAB environment variable, but it cannot be forcibly enabled by passing an MPI_COMM_WORLD
communicator if conditions for its applicability are not met.
The resulting local data extents for each process can be retrieved using get_local_sizes() or get_pencil(),
providing the necessary information for memory allocation and interfacing with external FFT libraries. The starting indices
(“starts”) of each process’s local data portion are determined based on its coordinates within the MPI grid topology.
Selecting plan effort¶
The effort parameter in dtFFT determines the level of optimization applied during plan creation,
influencing how data transposition is configured. On the host, dtFFT leverages custom MPI datatypes to perform transpositions,
tailored to the grid decomposition and data layout. On the GPU, transposition is handled by nvRTC-compiled kernels, optimized at runtime
for specific data sizes and types, with data exchange between GPUs facilitated by various backend options (e.g., NCCL, MPI P2P).
The supported effort levels defined by dtfft_effort_t control the extent of this optimization as follows:
DTFFT_ESTIMATE¶
This minimal-effort option prioritizes fast plan creation.
On the host, dtFFT selects a default grid decomposition (see Grid Decomposition above) and constructs MPI datatypes based
on environment variables such as DTFFT_DTYPE_X_Y and DTFFT_DTYPE_Y_Z (see MPI Datatype Selection variables),
which define the default send and receive strategies.
On the GPU, it uses a pre-selected backend specified via dtfft_config_t (see configuration details below), compiling an nvRTC
kernel tailored to the chosen backend.
DTFFT_MEASURE¶
With this moderate-effort setting, dtFFT explores multiple grid decomposition strategies to reduce communication overhead
during transposition, cycling through possible grid layouts to find an efficient configuration. On the host, it uses the same MPI datatypes
as defined by environment variables in DTFFT_ESTIMATE. On the GPU, it employs the same backend as specified in the configuration for DTFFT_ESTIMATE.
If a Cartesian communicator is provided, it reverts to DTFFT_ESTIMATE behavior, relying on the user-specified topology.
DTFFT_PATIENT¶
This maximum-effort option extends DTFFT_MEASURE by exhaustively optimizing transposition strategies. On the host, it cycles
through various custom MPI datatype combinations (e.g., contiguous send with sparse receive, sparse send with contiguous receive) to
minimize network latency and maximize throughput. On the GPU, it cycles through available GPU backends (e.g., NCCL, MPI P2P) to select
the fastest available backend.
The choice of effort impacts both plan creation time and runtime performance.
Higher effort levels (DTFFT_MEASURE and DTFFT_PATIENT) increase setup time but can enhance transposition efficiency,
especially for large datasets or complex grids.
If a user already knows the optimal grid decomposition, MPI datatypes, or GPU backend from a previous computation,
these can be pre-specified before plan creation: the grid via a custom MPI_Comm communicator, MPI datatypes through environment
variables (e.g., DTFFT_DTYPE_X_Y), and the GPU backend through dtfft_config_t.
Setting Additional Configurations¶
The dtfft_config_t type allows users to set additional configuration parameters for dtFFT before plan creation,
tailoring its behavior to specific needs. These settings are optional and can be applied using the constructor dtfft_config_t()
or the dtfft_create_config() function, followed by a call to dtfft_set_config().
Configurations must be set prior to creating a plan to take effect. The available parameters are:
Z-Slab Optimization (
enable_z_slab)A logical flag determining whether
dtFFTuses Z-slab optimization (see Grid Decomposition). When enabled (default:.true.), it reduces network data transfers in plans decomposed asNX × NY × NZ / Pby employing a two-dimensional FFT algorithm. Disabling it (.false.) may resolveDTFFT_ERROR_VKFFT_R2R_2D_PLANor improve performance if the underlying 2D FFT implementation is suboptimal. In most cases, Z-slab is faster due to fewer transpositions.
Execution platform (
platform)A
dtfft_platform_tvalue specifying the platform for executingdtFFTplans. By default, set toDTFFT_PLATFORM_HOST, meaning execution occurs on the host (CPU). Users can set it toDTFFT_PLATFORM_CUDAfor GPU execution, provided the build supports CUDA (DTFFT_WITH_CUDAdefined).Available only in CUDA-enabled builds.
CUDA Stream (
stream)A
dtfft_stream_tvalue specifying the main CUDA stream for GPU operations. By default,dtFFTmanages its own stream, retrievable viaget_stream(). Users can set a custom stream, taking responsibility for its destruction after the plan is destroyed withdestroy().Available only in CUDA-enabled builds.
GPU Backend (
backend)A
dtfft_backend_tvalue selecting the GPU backend for transposition wheneffortisDTFFT_ESTIMATEorDTFFT_MEASURE(see Selecting plan effort). The default isDTFFT_BACKEND_NCCLif NCCL is available in the library build; otherwise,DTFFT_BACKEND_MPI_P2P. Supported options include:DTFFT_BACKEND_MPI_DATATYPE: Backend using MPI datatypes.DTFFT_BACKEND_MPI_P2P: MPI peer-to-peer backend.DTFFT_BACKEND_MPI_A2A: MPI backend usingMPI_Alltoallv.DTFFT_BACKEND_MPI_P2P_PIPELINED: Pipelined MPI peer-to-peer backend.DTFFT_BACKEND_NCCL: NCCL backend.DTFFT_BACKEND_NCCL_PIPELINED: Pipelined NCCL backend.DTFFT_BACKEND_CUFFTMP: cuFFTMp backend.
Available only in CUDA-enabled builds.
MPI Backends (
enable_mpi_backends)A logical flag controlling whether MPI-based GPU backends (e.g., MPI P2P) are tested during autotuning with
DTFFT_PATIENTeffort (default:.false.). Disabled by default due to an OpenMPI bug (https://github.com/open-mpi/ompi/issues/12849) causing GPU memory leaks during autotuning (e.g., 8 GB leak for a 1024×1024×512 C2C plan with Z-slab on a single GPU, or 24 GB per GPU on four GPUs without Z-slab).Workarounds include disabling MPI backends or using
--mca btl_smcuda_use_cuda_ipc 0withmpiexec, though the latter reduces performance.Available only in CUDA-enabled builds.
Pipelined Backends (
enable_pipelined_backends)A logical flag enabling pipelined GPU backends (e.g., overlapping data copy and unpack) during
DTFFT_PATIENTautotuning (default:.true.). These require an additional internal buffer managed bydtFFT.Available only in CUDA-enabled builds.
NCCL Backends (
enable_nccl_backends)A logical flag enabling NCCL backends during
DTFFT_PATIENTautotuning (default:.true.).Available only in CUDA-enabled builds.
NVSHMEM Backends (
enable_nvshmem_backends)A logical flag enabling
NVSHMEM-enabled backends support duringDTFFT_PATIENTautotuning (default:.true.).Available only in CUDA-enabled builds.
These settings allow fine-tuning of transposition strategies and GPU behavior.
For example, disabling enable_mpi_backends mitigates memory leaks, while setting a custom stream integrates dtFFT
with existing CUDA workflows. Refer to the Fortran, C and C++ API pages for detailed parameter specifications.
Note
Almost all values can be overridden by setting the appropriate environment variable, which takes precedence if set. Refer to Environment Variables section.
Following example creates config object, disables Z-slab, enables MPI Backends and sets custom stream:
use cudafor
use dtfft
integer(cuda_stream_kind) :: my_stream
type(dtfft_config_t) :: config
integer :: ierr
! Create config with default values
config = dtfft_config_t()
! Disable Z-slab optimization
config%enable_z_slab = .false.
! Enable MPI backends for autotuning
config%enable_mpi_backends = .true.
! Create and set custom CUDA stream
ierr = cudaStreamCreate(my_stream)
config%stream = dtfft_stream_t(my_stream)
! Apply configuration
call dtfft_set_config(config)
! Now we can create a plan
#include <cuda_runtime.h>
#include <dtfft.h>
cudaStream_t my_stream;
dtfft_config_t config;
// Create config with default values
dtfft_create_config(&config);
// Disable Z-slab optimization
config.enable_z_slab = 0;
// Enable MPI backends for autotuning
config.enable_mpi_backends = 1;
// Create and set custom CUDA stream
cudaStreamCreate(&my_stream);
config.stream = (dtfft_stream_t)my_stream;
// Apply configuration
dtfft_set_config(config);
// Now we can create a plan
#include <cuda_runtime.h>
#include <dtfft.hpp>
cudaStream_t my_stream;
dtfft::Config config; // Automatically fills with default values
// Disable Z-slab optimization
config.set_enable_z_slab(false);
// Enable MPI backends for autotuning
config.set_enable_mpi_backends(true);
// Create and set custom CUDA stream
cudaStreamCreate(&my_stream);
config.set_stream((dtfft_stream_t)my_stream);
// Apply configuration
dtfft::set_config(config);
// Now we can create a plan
Plan Creation Using Pencil Decomposition¶
In addition to the standard plan creation using global dimensions, dtFFT supports creating plans using the dtfft_pencil_t structure.
This advanced feature allows users to define custom grid decompositions and provides greater control over data distribution across MPI processes.
When to Use Pencil-Based Plan Creation¶
Pencil-based plan creation is useful when:
You need custom grid decomposition that differs from
dtFFT’s default strategyYou want to integrate with existing domain decomposition from other libraries
You need to ensure specific data locality requirements
You want to reuse decomposition information from previous computations
Creating Plans with Pencil Structure¶
All plan types support pencil-based constructors. Both Fortran and C++ overload the existing plan constructors to accept a pencil structure.
In the C API, a separate function is provided to create plans using a pencil structure. The pencil structure must be defined before plan creation, specifying the local data portion for each MPI process.
Fortran users should call dtfft_pencil_t constructor and provide two arrays: starts and counts which must provide the local data portion for each MPI process.
The following example demonstrates how to create a C2C plan using a pencil structure for a 64x64x64 grid, splitting only in the Z (slowest) dimension across MPI processes:
#include "dtfft.f03"
use iso_fortran_env
use dtfft
use mpi
type(dtfft_plan_c2c_t) :: plan
type(dtfft_pencil_t) :: my_pencil
integer(int32) :: error_code
integer(int32) :: starts(3), counts(3)
integer(int32) :: rank, size
call MPI_Init()
call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)
call MPI_Comm_size(MPI_COMM_WORLD, size, ierr)
! Define custom decomposition for 64x64x64 grid
! Example: split only in Z dimension
starts = [0, 0, rank * (64 / size)]
counts = [64, 64, 64 / size]
! Create pencil structure
pencil = dtfft_pencil_t(starts, counts)
! Create C2C plan using pencil
call plan%create(pencil, MPI_COMM_WORLD, DTFFT_DOUBLE, DTFFT_ESTIMATE, DTFFT_EXECUTOR_NONE, error_code)
DTFFT_CHECK(error_code)
#include <dtfft.h>
#include <mpi.h>
int main(int argc, char *argv[]) {
dtfft_plan_t plan;
dtfft_pencil_t pencil;
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
// Define custom decomposition for 64x64x64 grid
pencil.ndims = 3;
pencil.starts[0] = 0;
pencil.starts[1] = 0;
pencil.starts[2] = rank * (64 / size);
pencil.counts[0] = 64;
pencil.counts[1] = 64;
pencil.counts[2] = 64 / size;
// Create C2C plan using pencil
DTFFT_CALL( dtfft_create_plan_c2c_pencil(&pencil, MPI_COMM_WORLD,
DTFFT_DOUBLE, DTFFT_ESTIMATE, DTFFT_EXECUTOR_NONE, &plan) );
return 0;
}
#include <dtfft.hpp>
#include <mpi.h>
#include <vector>
int main(int argc, char *argv[]) {
MPI_Init(&argc, &argv);
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
// Define custom decomposition for 64x64x64 grid
std::vector<int32_t> starts = {0, 0, rank * (64 / size)};
std::vector<int32_t> counts = {64, 64, 64 / size};
// Create pencil
auto pencil = dtfft::Pencil(starts, counts);
// Create C2C plan using pencil
dtfft::PlanC2C plan(pencil, MPI_COMM_WORLD, dtfft::Precision::DOUBLE,
dtfft::Effort::ESTIMATE, dtfft::Executor::NONE);
return 0;
}
Memory Allocation¶
After a plan is created, users may need to determine the memory required to execute it.
The plan method get_local_sizes() retrieves the number of elements in “real” and “Fourier” spaces and the
minimum number of elements that must be allocated:
in_starts: Start indices of the local data portion in real space (0-based)
in_counts: Number of elements in the local data portion in real space
out_starts: Start indices of the local data portion in Fourier space (0-based)
out_counts: Number of elements in the local data portion in Fourier space
alloc_size: Minimum number of elements needed for
in,out, orauxbuffers
Arrays in_starts, in_counts, out_starts, and out_counts must have at least as many elements as the plan’s dimensions.
The minimum number of bytes required for each buffer is alloc_size * element_size.
The element_size can be obtained by get_element_size() which returns:
C2C:
2 * sizeof(double)(double precision) or2 * sizeof(float)(single precision)R2R and R2C:
sizeof(double)(double precision) orsizeof(float)(single precision)
integer(int64) :: alloc_size, element_size
! Get number of elements
call plan%get_local_sizes(alloc_size=alloc_size)
! OR use convenient wrapper
! alloc_size = plan%get_alloc_size()
! Optionally get element size in bytes
element_size = plan%get_element_size()
size_t alloc_size;
// Get number of elements
dtfft_get_local_sizes(plan, NULL, NULL, NULL, NULL, &alloc_size);
// OR use convenient wrapper
// dtfft_get_alloc_size(plan, &alloc_size);
// Optionally get element size in bytes
size_t element_size;
dtfft_get_element_size(plan, &element_size);
size_t alloc_size;
// Get number of elements
DTFFT_CXX_CALL( plan.get_local_sizes(nullptr, nullptr, nullptr, nullptr, &alloc_size) );
// OR use wrapper
// DTFFT_CXX_CALL( plan.get_alloc_size(&alloc_size) );
// Optionally get element size in bytes
size_t element_size;
DTFFT_CXX_CALL( plan.get_element_size(&element_size) );
For 3D plans, get_local_sizes() does not detail the intermediate Y-direction layout.
This information, useful for transpose-only plans or when using unsupported FFT libraries, can be retrieved via the pencil
interface (see Pencil Decomposition below). Pencil IDs start from 1 in both C and Fortran.
The dtFFT library provides functions to allocate and free memory tailored to the plan:
mem_alloc(): Allocates memory.mem_free(): Frees memory allocated bymem_alloc().
Fortran interface provides additional methods for memory allocation and deallocation:
mem_alloc_ptr(): Allocates memory and returns a pointer of typec_ptr.mem_free_ptr(): Frees memory allocated bymem_alloc_ptr().
Host Version¶
Allocates memory based on the FFT library:
fftw_mallocfor FFTW3mkl_mallocfor MKL DFTaligned_alloc(16-byte alignment) from C11 Standard library for transpose-only plans.
GPU Version¶
Allocates memory based on the dtfft_backend_t:
ncclMemAllocfor NCCL (if available)nvshmem_mallocfor NVSHMEM-based backendscudaMallocotherwise.
Future versions may support HIP-based allocations.
If NCCL is used and supports buffer registration via ncclCommRegister, and the environment variable
DTFFT_NCCL_BUFFER_REGISTER is not set to 0, the allocated buffer will also be registered.
This registration optimizes communication performance by reducing the overhead of memory operations,
which is particularly beneficial for workloads with repeated communication patterns.
use iso_fortran_env
! Host version
complex(real64), pointer :: a(:), b(:), aux(:)
! CUDA Fortran version
complex(real64), device, contiguous, pointer :: a(:), b(:), aux(:)
! Allocates memory
call plan%mem_alloc(alloc_size, a, error_code=error_code); DTFFT_CHECK(error_code)
call plan%mem_alloc(alloc_size, b, error_code=error_code); DTFFT_CHECK(error_code)
call plan%mem_alloc(alloc_size, aux, error_code=error_code); DTFFT_CHECK(error_code)
! or use pointers of type c_ptr
use iso_c_binding
type(c_ptr) :: a_ptr, b_ptr, aux_ptr
integer(int64) :: alloc_bytes
alloc_bytes = alloc_size * element_size
call plan%mem_alloc_ptr(alloc_bytes, a_ptr, error_code=error_code); DTFFT_CHECK(error_code)
call plan%mem_alloc_ptr(alloc_bytes, b_ptr, error_code=error_code); DTFFT_CHECK(error_code)
call plan%mem_alloc_ptr(alloc_bytes, aux_ptr, error_code=error_code); DTFFT_CHECK(error_code)
size_t alloc_bytes = alloc_size * element_size;
double *a, *b, *aux;
DTFFT_CALL( dtfft_mem_alloc(plan, alloc_bytes, (void**)&a) );
DTFFT_CALL( dtfft_mem_alloc(plan, alloc_bytes, (void**)&b) );
DTFFT_CALL( dtfft_mem_alloc(plan, alloc_bytes, (void**)&aux) );
#include <complex>
size_t alloc_bytes = alloc_size * element_size;
std::complex<double> *a, *b, *aux;
DTFFT_CXX_CALL( plan.mem_alloc(alloc_bytes, (void**)&a) );
DTFFT_CXX_CALL( plan.mem_alloc(alloc_bytes, (void**)&b) );
DTFFT_CXX_CALL( plan.mem_alloc(alloc_bytes, (void**)&aux) );
Note
Memory allocated with mem_alloc() must be deallocated with mem_free() before the plan is destroyed to avoid memory leaks.
Pencil Decomposition¶
For detailed layout information in 3D plans (e.g., intermediate states like Y-direction distribution), use
the get_pencil() method. This returns a dtfft_pencil_t structure containing:
dim: Aligned dimension ID (1 for X, 2 for Y, 3 for Z).
ndims: Number of dimensions in the pencil (2 or 3)
starts: Local start indices in natural Fortran order. (Allocatable array of size
ndims)counts: Local element counts in natural Fortran order (Allocatable array of size
ndims)size: Total number of elements in a pencil
integer(int8) :: i
type(dtfft_pencil_t) :: pencils(3)
do i = 1, 3
! Get pencil for dimension i
pencils(i) = plan%get_pencil(i, error_code)
DTFFT_CHECK(error_code)
! Access pencil properties, e.g., pencils(i)%dim, pencils(i)%starts
end do
dtfft_pencil_t pencils[3];
for (int8_t i = 0; i < 3; i++) {
DTFFT_CALL( dtfft_get_pencil(plan, i + 1, &pencils[i]) );
// Access pencil properties, e.g., pencils[i].dim, pencils[i].starts
}
std::vector<dtfft::Pencil> pencils;
for (int8_t i = 0; i < 3; i++) {
dtfft::Pencil pencil = plan.get_pencil(i + 1); // This call will throw an exception if an error occurs
pencils.push_back(pencil);
// Access pencil properties, e.g., pencils[i].get_dim(), pencils[i].get_starts()
}
In C++, the dtfft::Pencil class provides properties via getter methods:
get_ndims(): Returns the number of dimensionsget_dim(): Returns the aligned dimension IDget_starts(): Returns the start indices as astd::vector<int32_t>get_counts(): Returns the element counts as astd::vector<int32_t>get_size(): Returns the total number of elements.c_struct(): Returns the underlying C structure (dtfft_pencil_t)
Plan properties¶
After creating a plan, several methods are available to inspect its runtime configuration and behavior
These methods, defined in dtfft_plan_t, provide valuable insights into the plan’s setup and are
particularly useful for debugging or integrating with custom workflows. The following methods are supported:
get_z_slab_enabled(): Returns a logical value indicating whether Z-slab optimization is active in the plan, as configured viadtfft_config_t(see Setting Additional Configurations). This helps users confirm if the optimization is applied, especially when troubleshooting performance or compatibility issues.get_backend(): Retrieves the GPU backend (e.g., NCCL, MPI P2P) selected during plan creation or autotuning withDTFFT_PATIENTeffort (see Selecting plan effort).Available only in CUDA-enabled builds, this method allows users to verify the transposition strategy chosen for GPU execution.
get_stream(): Returns the CUDA stream associated with the plan, either the default stream managed bydtFFTor a custom one set viadtfft_config_t(see Setting Additional Configurations).Available only in CUDA-enabled builds, it enables integration with existing CUDA workflows by exposing the stream used for GPU operations.
report(): Prints detailed plan information to stdout, including grid decomposition, backend selection, and optimization settings. This diagnostic tool aids in understanding the plan’s configuration and troubleshooting unexpected behavior.
These methods provide a window into the plan’s internal state, allowing users to validate settings or gather diagnostics post-creation. They remain accessible until the plan is destroyed with destroy().
Plan Execution¶
There are two primary methods to execute a plan in dtFFT: transpose and execute.
Below, we detail each method, including their behavior for host and GPU versions of the API.
Transpose¶
The first method is to call the transpose() method of the plan.
Signature¶
The signature is as follows:
subroutine dtfft_plan_t%transpose(in, out, transpose_type, error_code)
type(*) intent(inout) :: in(..)
type(*) intent(inout) :: out(..)
type(dtfft_transpose_t), intent(in) :: transpose_type
integer(int32), optional, intent(out) :: error_code
end subroutine
subroutine dtfft_plan_t%transpose_ptr(in, out, transpose_type, error_code)
type(c_ptr) intent(in) :: in
type(c_ptr) intent(in) :: out
type(dtfft_transpose_t), intent(in) :: transpose_type
integer(int32), optional, intent(out) :: error_code
end subroutine
dtfft_error_t
dtfft_transpose(
dtfft_plan_t plan,
void *in,
void *out,
const dtfft_transpose_t transpose_type);
dtfft::Error
dtfft::Plan::transpose(
void *in,
void *out,
const dtfft::Transpose transpose_type);
Description¶
This method transposes data according to the specified transpose_type. Supported options include:
DTFFT_TRANSPOSE_X_TO_Y: Transpose from X to YDTFFT_TRANSPOSE_Y_TO_X: Transpose from Y to XDTFFT_TRANSPOSE_Y_TO_Z: Transpose from Y to Z (valid only for 3D plans)DTFFT_TRANSPOSE_Z_TO_Y: Transpose from Z to Y (valid only for 3D plans)DTFFT_TRANSPOSE_X_TO_Z: Transpose from X to Z (valid only for 3D plans using Z-slab)DTFFT_TRANSPOSE_Z_TO_X: Transpose from Z to X (valid only for 3D plans using Z-slab)
Note
Passing the same pointer to both in and out is not permitted; doing so triggers the error DTFFT_ERROR_INPLACE_TRANSPOSE.
Note
Calling transpose() for R2C plan is not allowed.
Host Version: Executes a single MPI_Alltoall(w) call using non-contiguous MPI Datatypes and returns once the out
buffer contains the transposed data, leaving the in buffer unchanged.
GPU Version: Performs a two-step transposition:
Launches an nvRTC-compiled kernel to transpose data locally. On a single GPU, this completes the task, and control returns to the user.
Performs data redistribution using the selected GPU backend (e.g., MPI, NCCL), followed by final processing (e.g., unpacking via nvRTC or copying to
out) Differences between backends begin at this step (see below for specifics).
In the GPU version, the in buffer may serve as intermediate storage, potentially modifying its contents,
except when operating on a single GPU, where it remains unchanged.
GPU Backend-Specific Behavior¶
MPI-Based Backends (
DTFFT_BACKEND_MPI_P2PandDTFFT_BACKEND_MPI_A2A):After local transposition, redistributes data using CUDA-aware MPI. Data destined for the same GPU (“self” data) is copied via
cudaMemcpyAsync.For MPI Peer-to-Peer (
MPI_P2P), it issues non-blockingMPI_IrecvandMPI_Isendcalls (orMPI_Recv_initandMPI_Send_initwithMPI_Startallif built withDTFFT_ENABLE_PERSISTENT_COMM) for point-to-point exchanges between GPUs, completing withMPI_Waitall; an nvRTC kernel then unpacks all data at once.For MPI All-to-All (
MPI_A2A), it performs a singleMPI_Ialltoallvcall (orMPI_Alltoallv_initwithMPI_Startif built withDTFFT_ENABLE_PERSISTENT_COMMand supported by MPI), completing withMPI_Wait; an nvRTC kernel then unpacks the data.Pipelined MPI Peer-to-Peer (
DTFFT_BACKEND_MPI_P2P_PIPELINED):After local transposition, redistributes data similarly to
MPI_P2Pusing CUDA-aware MPI with non-blockingMPI_IrecvandMPI_Isendcalls (orMPI_Recv_initandMPI_Send_initwithMPI_Startallif built withDTFFT_ENABLE_PERSISTENT_COMM).Data destined for the same GPU (“self” data) is copied via
cudaMemcpyAsync. UnlikeMPI_P2P, as soon as data arrives from a process i, it is immediately unpacked by launching an nvRTC kernel specific to that process’s data.This results in N nvRTC kernels (one per process) instead of a single kernel unpacking all data, enabling pipelining of communication and computation to reduce latency.
NCCL-Based Backends (
DTFFT_BACKEND_NCCLandDTFFT_BACKEND_NCCL_PIPELINED):After local transposition, redistributes data using the NCCL library for GPU-to-GPU communication.
For NCCL (
DTFFT_BACKEND_NCCL), it executes a cycle ofncclSendandncclRecvcalls withinncclGroupStartandncclGroupEndto perform point-to-point exchanges between all processes, including “self” data. Once communication completes, an nvRTC kernel unpacks all data at once, similar toMPI_P2P.For Pipelined NCCL (
DTFFT_BACKEND_NCCL_PIPELINED), it copies “self” data usingcudaMemcpyAsyncand immediately unpacks it with an nvRTC kernel in a parallel stream created bydtFFT. Concurrently, in main stream, it runs the samencclSend/ncclRecvcycle (withinncclGroupStartandncclGroupEnd) for data exchange with other processes, excluding “self” data. After communication completes, an nvRTC kernel unpacks the data received from all other processes.cuFFTMp (
DTFFT_BACKEND_CUFFTMP):After local transposition from the
inbuffer to theoutbuffer using an nvRTC kernel, redistributes data using the cuFFTMp library by callingcufftMpExecReshapeAsync. This function performs an asynchronous all-to-all exchange across multiple GPUs, reshaping the data from theoutbuffer back into theinbuffer. Since the final transposed data is required in theoutbuffer, it is then copied fromintooutusingcudaMemcpyAsync.Pipelined cuFFTMp (
DTFFT_BACKEND_CUFFTMP_PIPELINED):This backend optimizes the standard
cuFFTMpapproach by eliminating the finalcudaMemcpyAsyncstep. It begins with a local transposition from theinbuffer to an auxiliary (aux) buffer using an nvRTC kernel. Then, it callscufftMpExecReshapeAsyncto perform the all-to-all exchange, reshaping the data directly from theauxbuffer into the finaloutbuffer. This approach avoids the extra copy required by the standardcuFFTMpbackend, potentially reducing latency, but requires an additionalauxbuffer for its operation.
Note
Performance and behavior may vary based on GPU interconnects (e.g., NVLink), MPI implementation, and system configuration.
To automatically select the fastest GPU backend for a given system, use the DTFFT_PATIENT effort level when creating plan,
which tests each backend and chooses the most efficient one.
Note
Pipelined backends (DTFFT_BACKEND_MPI_P2P_PIPELINED and DTFFT_BACKEND_NCCL_PIPELINED) require an
additional aux buffer, which is managed internally by dtFFT and inaccessible to the user.
Similarly, DTFFT_BACKEND_CUFFTMP may require an aux buffer if cufftMpGetReshapeSize returns a value greater than 0,
such as when the environment variable CUFFT_RESHAPE_USE_PACKING=1 is set.
In all other cases, transposition requires only the in and out buffers.
Example¶
Below is an example of transposing data from X to Y and back:
! Assuming plan is created and buffers `a` and `b` are allocated.
call plan%transpose(a, b, DTFFT_TRANSPOSE_X_TO_Y, error_code)
DTFFT_CHECK(error_code) ! Checks for errors, e.g., DTFFT_ERROR_INPLACE_TRANSPOSE
! Process Y-aligned data in buffer `b`
! ... (e.g., apply scaling or analysis)
! Reverse transposition
call plan%transpose(b, a, DTFFT_TRANSPOSE_Y_TO_X, error_code)
DTFFT_CHECK(error_code)
! Alternatively, using pointers of type c_ptr
call plan%transpose_ptr(a_ptr, b_ptr, DTFFT_TRANSPOSE_X_TO_Y, error_code)
DTFFT_CHECK(error_code)
! ...
call plan%transpose_ptr(b_ptr, a_ptr, DTFFT_TRANSPOSE_Y_TO_X, error_code)
DTFFT_CHECK(error_code)
// Assuming plan is created and buffers `a` and `b` are allocated.
DTFFT_CALL( dtfft_transpose(plan, a, b, DTFFT_TRANSPOSE_X_TO_Y) )
// Process Y-aligned data in buffer `b`
// ... (e.g., apply scaling or analysis)
// Reverse transposition
DTFFT_CALL( dtfft_transpose(plan, b, a, DTFFT_TRANSPOSE_Y_TO_X) )
// Assuming plan is created and buffers `a` and `b` are allocated.
DTFFT_CXX_CALL( plan.transpose(a, b, dtfft::Transpose::X_TO_Y) )
// Process Y-aligned data in buffer `b`
// ... (e.g., apply scaling or analysis)
// Reverse transposition
DTFFT_CXX_CALL( plan.transpose(b, a, dtfft::Transpose::Y_TO_X) )
Execute¶
The second method is to call the execute() method of the plan.
Signature¶
The signature is as follows:
subroutine dtfft_plan_t%execute(in, out, execute_type, aux, error_code)
type(*) intent(inout) :: in(..)
type(*) intent(inout) :: out(..)
type(dtfft_execute_t), intent(in) :: execute_type
type(*), optional, intent(inout) :: aux(..)
integer(int32), optional, intent(out) :: error_code
end subroutine
subroutine dtfft_plan_t%execute_ptr(in, out, execute_type, aux, error_code)
type(c_ptr) intent(in) :: in
type(c_ptr) intent(in) :: out
type(dtfft_execute_t), intent(in) :: execute_type
type(c_ptr) intent(in) :: aux
integer(int32), optional, intent(out) :: error_code
end subroutine
dtfft_error_t
dtfft_execute(
dtfft_plan_t plan,
void *in,
void *out,
const dtfft_execute_t execute_type,
void *aux);
dtfft::Error
dtfft::Plan::execute(
void *in,
void *out,
const dtfft::Execute execute_type,
void *aux=nullptr);
Description¶
This method executes a plan, performing transpositions and optionally FFTs based on the specified execute_type.
It supports in-place execution; the same pointer can be safely passed to both in and out.
To optimize memory usage, dtFFT uses the in buffer as intermediate storage, overwriting its contents.
Users needing to preserve original data should copy it elsewhere.
The key parameter is execute_type, with two options:
DTFFT_EXECUTE_FORWARD: Forward executionDTFFT_EXECUTE_BACKWARD: Backward execution
For 3D plans, the method operates as follows:
Forward Execution (DTFFT_EXECUTE_FORWARD):
If
Transpose-Only:Transpose from X to Y
Transpose from Y to Z
If
Transpose-Onlywith Z-slab and distinctinandout:Transpose from X to Z
If using FFT:
Forward FFT in X direction
Transpose from X to Y
Forward FFT in Y direction
Transpose from Y to Z
Forward FFT in Z direction
If using FFT with Z-slab:
Forward 2D FFT in X-Y directions
Transpose from X to Z
Forward FFT in Z direction
Backward Execution (DTFFT_EXECUTE_BACKWARD):
If
Transpose-Only:Transpose from Z to Y
Transpose from Y to X
If
Transpose-Onlywith Z-slab and distinctinandout:Transpose from Z to X
If using FFT:
Backward FFT in Z direction
Transpose from Z to Y
Backward FFT in Y direction
Transpose from Y to X
Backward FFT in X direction
If using FFT with Z-slab:
Backward FFT in Z direction
Transpose from Z to X
Backward 2D FFT in X-Y directions
Note
For Transpose-Only plans with a Z-slab and identical in and out pointers, execution uses a
two-step transposition, as direct transposition is not possible with a single pointer.
Note
The only case when in-place execution is not allowed is 2D Transpose-Only plan. Doing so will trigger the error DTFFT_ERROR_INPLACE_TRANSPOSE.
An optional auxiliary buffer aux may be provided. If omitted on the first call to execute(),
it is allocated internally and freed when the plan is destroyed. C users can pass NULL to opt out.
Example¶
Below is an example of executing a plan forward and backward:
! Assuming a 3D FFT plan is created and buffers `a`, `b`, and `aux` are allocated
call plan%execute(a, b, DTFFT_EXECUTE_FORWARD, aux, error_code)
DTFFT_CHECK(error_code) ! Checks for execution errors
! Process Fourier-space data in buffer `b`
! ... (e.g., apply filtering)
! Backward execution
call plan%execute(b, a, DTFFT_EXECUTE_BACKWARD, aux, error_code)
DTFFT_CHECK(error_code)
! Alternatively, using pointers of type c_ptr. If aux is not needed, pass c_null_ptr
call plan%execute_ptr(a_ptr, b_ptr, DTFFT_EXECUTE_FORWARD, aux_ptr, error_code)
DTFFT_CHECK(error_code)
! ...
call plan%execute_ptr(b_ptr, a_ptr, DTFFT_EXECUTE_BACKWARD, c_null_ptr, error_code)
DTFFT_CHECK(error_code)
// Assuming a 3D FFT plan is created and buffers `a`, `b`, and `aux` are allocated
DTFFT_CALL( dtfft_execute(plan, a, b, DTFFT_EXECUTE_FORWARD, aux) )
// Process Fourier-space data in buffer `b`
// ... (e.g., apply filtering)
// Backward execution
DTFFT_CALL( dtfft_execute(plan, b, a, DTFFT_EXECUTE_BACKWARD, aux) )
// Assuming a 3D FFT plan is created and buffers `a`, `b`, and `aux` are allocated
DTFFT_CXX_CALL( plan.execute(a, b, dtfft::Execute::FORWARD, aux) )
// Process Fourier-space data in buffer `b`
// ... (e.g., apply filtering)
// Backward execution
DTFFT_CXX_CALL( plan.execute(b, a, dtfft::Execute::BACKWARD, aux) )
GPU Notes¶
Both transpose and execute in the GPU version operate asynchronously.
When either function returns, computations are queued in a CUDA stream but may not be complete.
Full synchronization with the host requires calling cudaDeviceSynchronize, cudaStreamSynchronize, or !$acc wait (for OpenACC).
During execution, dtFFT may use multiple CUDA streams, but the final computation stage always occurs in the
stream returned by get_stream(). Thus, synchronization may be unnecessary if users submit additional kernels to that stream.
Plan Finalization¶
To fully release all memory resources allocated by dtFFT for a plan,
the plan must be explicitly destroyed. This ensures that all internal buffers and resources associated with the plan are freed.
Note
If buffers were allocated using mem_alloc(), they must be deallocated with mem_free() before calling the destroy method.
Failing to do so may result in memory leaks or undefined behavior.
Example¶
Below is an example of properly finalizing a plan and freeing allocated memory:
! Assuming a plan and buffers ``a``, ``b`` and ``aux`` are created and allocated with ``mem_alloc``
call plan%mem_free(a, error_code); DTFFT_CHECK(error_code)
call plan%mem_free(b, error_code); DTFFT_CHECK(error_code)
call plan%mem_free(aux, error_code); DTFFT_CHECK(error_code)
! Pointers allocated via mem_alloc_ptr must be freed with ``mem_free_ptr``
call plan%mem_free_ptr(a_ptr, error_code); DTFFT_CHECK(error_code)
call plan%mem_free_ptr(b_ptr, error_code); DTFFT_CHECK(error_code)
call plan%mem_free_ptr(aux_ptr, error_code); DTFFT_CHECK(error_code)
call plan%destroy(error_code) ! Destroy the plan
DTFFT_CHECK(error_code)
// Assuming a plan and buffers ``a``, ``b`` and ``aux`` are created and allocated with `dtfft_mem_alloc`
DTFFT_CALL( dtfft_mem_free(plan, a) ) // Free buffer ``a``
DTFFT_CALL( dtfft_mem_free(plan, b) ) // Free buffer ``b``
DTFFT_CALL( dtfft_mem_free(plan, aux) ) // Free buffer ``aux``
DTFFT_CALL( dtfft_destroy(&plan) ) // Destroy the plan
// Assuming a plan and buffers ``a``, ``b`` and ``aux`` are created and allocated with `mem_alloc`
DTFFT_CXX_CALL( plan.mem_free(a) ) // Free buffer ``a``
DTFFT_CXX_CALL( plan.mem_free(b) ) // Free buffer ``b``
DTFFT_CXX_CALL( plan.mem_free(aux) ) // Free buffer ``aux``
DTFFT_CXX_CALL( plan.destroy() ) // Explicitly destroy the plan (optional if using destructor)
// Automatic ~Plan() call when `plan` goes out of scope
Complete Example¶
The following example demonstrates the full lifecycle of a dtFFT complex-to-complex plan:
creating a plan, allocating memory, executing forward and backward transformations, and properly finalizing resources.
program dtfft_sample
#include "dtfft.f03"
use iso_fortran_env
use dtfft
use mpi ! or use mpi_f08
use iso_c_binding
implicit none
type(dtfft_plan_c2c_t) :: plan
type(dtfft_config_t) :: config
integer(int32) :: dims(3) = [64, 64, 64] ! Example dimensions
integer(int32) :: error_code
integer(int64) :: alloc_size, element_size, alloc_bytes
complex(real64), pointer :: a(:), b(:), aux(:)
call MPI_Init(error_code)
! Create dtfft_config_t object with default values
config = dtfft_config_t()
! Disable Z-slab
config%enable_z_slab = .false.
! Apply configuration to dtFFT
call dtfft_set_config(config, error_code)
DTFFT_CHECK(error_code)
! Create plan
call plan%create(dims, MPI_COMM_WORLD, DTFFT_DOUBLE, DTFFT_PATIENT, DTFFT_EXECUTOR_NONE, error_code)
DTFFT_CHECK(error_code)
! Obtain allocation sizes
alloc_size = plan%get_alloc_size(error_code); DTFFT_CHECK(error_code)
! Allocate memory
call plan%mem_alloc(alloc_size, a, error_code); DTFFT_CHECK(error_code)
call plan%mem_alloc(alloc_size, b, error_code); DTFFT_CHECK(error_code)
call plan%mem_alloc(alloc_size, aux, error_code); DTFFT_CHECK(error_code)
! Forward execution
call plan%execute(a, b, DTFFT_EXECUTE_FORWARD, aux, error_code)
DTFFT_CHECK(error_code)
! Process Fourier-space data in buffer `b` (e.g., apply filtering)
! ...
! Backward execution
call plan%execute(b, a, DTFFT_EXECUTE_BACKWARD, aux, error_code)
DTFFT_CHECK(error_code)
! Free memory
call plan%mem_free(a, error_code); DTFFT_CHECK(error_code)
call plan%mem_free(b, error_code); DTFFT_CHECK(error_code)
call plan%mem_free(aux, error_code); DTFFT_CHECK(error_code)
! Destroy the plan
call plan%destroy(error_code)
DTFFT_CHECK(error_code)
call MPI_Finalize(error_code)
end program dtfft_sample
#include <dtfft.h>
#include <mpi.h>
int main(int argc, char *argv[])
{
dtfft_plan_t plan;
dtfft_complex *a, *b, *aux; // Use dtfft_complex from dtfft.h
int32_t dims[3] = {64, 64, 64}; // Example dimensions
size_t alloc_size;
MPI_Init(&argc, &argv);
dtfft_config_t config;
// Set default values to config
dtfft_create_config(&config);
// Disable Z-slab
config.enable_z_slab = 0;
// Apply configuration to dtFFT
DTFFT_CALL( dtfft_set_config(config) );
// Create plan
DTFFT_CALL( dtfft_create_plan_c2c(3, dims, MPI_COMM_WORLD, DTFFT_DOUBLE, DTFFT_PATIENT, DTFFT_EXECUTOR_NONE, &plan) );
// Obtain allocation size
DTFFT_CALL( dtfft_get_alloc_size(plan, &alloc_size) );
// Allocate memory
DTFFT_CALL( dtfft_mem_alloc(plan, sizeof(dtfft_complex) * alloc_size, (void**)&a) );
DTFFT_CALL( dtfft_mem_alloc(plan, sizeof(dtfft_complex) * alloc_size, (void**)&b) );
DTFFT_CALL( dtfft_mem_alloc(plan, sizeof(dtfft_complex) * alloc_size, (void**)&aux) );
// Forward execution
DTFFT_CALL( dtfft_execute(plan, a, b, DTFFT_EXECUTE_FORWARD, aux) );
// Process Fourier-space data in buffer `b` (e.g., apply filtering)
// ...
// Backward execution
DTFFT_CALL( dtfft_execute(plan, b, a, DTFFT_EXECUTE_BACKWARD, aux) );
// Free memory
DTFFT_CALL( dtfft_mem_free(plan, a) );
DTFFT_CALL( dtfft_mem_free(plan, b) );
DTFFT_CALL( dtfft_mem_free(plan, aux) );
// Destroy the plan
DTFFT_CALL( dtfft_destroy(&plan) );
MPI_Finalize();
return 0;
}
#include <dtfft.hpp>
#include <mpi.h>
#include <complex>
#include <vector>
using namespace dtfft;
int main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
std::vector<int32_t> dims = {64, 64, 64}; // Example dimensions
// Set default values to config
Config config;
config.set_enable_z_slab(false);
// Apply configuration to dtFFT
DTFFT_CXX_CALL( set_config(config) );
// Create plan
PlanC2C plan(dims, MPI_COMM_WORLD, Precision::DOUBLE, Effort::PATIENT, Executor::NONE);
size_t alloc_size, element_size;
DTFFT_CXX_CALL( plan.get_alloc_size(&alloc_size) );
DTFFT_CXX_CALL( plan.get_element_size(&element_size) );
size_t alloc_bytes = alloc_size * element_size;
std::complex<double> *a, *b, *aux;
// Allocate memory
DTFFT_CXX_CALL( plan.mem_alloc(alloc_bytes, (void**)&a) );
DTFFT_CXX_CALL( plan.mem_alloc(alloc_bytes, (void**)&b) );
DTFFT_CXX_CALL( plan.mem_alloc(alloc_bytes, (void**)&aux) );
// Forward execution
DTFFT_CXX_CALL( plan.execute(a, b, Execute::FORWARD, aux) );
// Process Fourier-space data in buffer `b` (e.g., apply filtering)
// ...
// Backward execution
DTFFT_CXX_CALL( plan.execute(b, a, Execute::BACKWARD, aux) );
// Free memory
DTFFT_CXX_CALL( plan.mem_free(a) );
DTFFT_CXX_CALL( plan.mem_free(b) );
DTFFT_CXX_CALL( plan.mem_free(aux) );
// Explicitly destroy the plan
DTFFT_CXX_CALL( plan.destroy() );
MPI_Finalize();
return 0;
}