GPU Computing
Tarang.jl provides comprehensive GPU acceleration through CUDA.jl, enabling significant speedups for spectral simulations on NVIDIA GPUs.
Overview
Requirements
GPU support requires an NVIDIA GPU with CUDA capability 5.0+ and the CUDA.jl package.
Key Features
| Feature | Description |
|---|---|
| Automatic Dispatch | Arrays automatically use GPU kernels when on GPU memory |
| CUFFT Integration | Optimized FFT plans via NVIDIA's cuFFT library |
| Custom Kernels | KernelAbstractions.jl for portable CPU/GPU code |
| Memory Pools | Efficient GPU memory management with pooling |
| Multi-GPU | MPI + CUDA for distributed GPU computing |
| Mixed Precision | Float32 support for memory-bound problems |
Quick Start
Basic GPU Setup
using Tarang
using CUDA
# Check GPU availability
@assert CUDA.functional() "CUDA not available"
# Create distributor with GPU architecture
coords = CartesianCoordinates("x", "y")
dist = Distributor(coords; mesh=(1,), dtype=Float64, device=GPU())
# Create bases and fields (automatically on GPU)
xbasis = Fourier(coords, "x", 256)
ybasis = Fourier(coords, "y", 256)
field = ScalarField(dist, "u", (xbasis, ybasis))
# Field data is a CuArray
@assert field["g"] isa CuArrayCPU vs GPU Architecture
# CPU execution (default)
dist_cpu = Distributor(coords; device=CPU())
# GPU execution
dist_gpu = Distributor(coords; device=GPU())
# Check architecture
arch = dist_gpu.architecture # GPU()GPU Transforms
Automatic FFT Acceleration
When fields are on GPU, transforms automatically use CUFFT:
using Tarang, CUDA
dist = Distributor(coords; device=GPU())
field = ScalarField(dist, "u", (xbasis, ybasis))
# Initialize with GPU data
field["g"] .= CUDA.rand(Float64, size(field["g"])...)
# Forward transform (uses CUFFT automatically)
forward_transform!(field)
# Backward transform
backward_transform!(field)FFT Mode Control
Control when GPU FFTs are used:
# Per-field control
set_gpu_fft_mode!(field, :gpu) # Always use GPU FFT
set_gpu_fft_mode!(field, :cpu) # Always use CPU FFT
set_gpu_fft_mode!(field, :auto) # Heuristic-based (default)
# Global threshold for :auto mode
# Use GPU FFT only if array has >= N elements
set_gpu_fft_min_elements!(64_000)Mixed Fourier-Chebyshev Transforms
GPU DCT for Chebyshev bases:
# Mixed basis domain
xbasis = Fourier(coords, "x", 256) # FFT
zbasis = ChebyshevT(coords, "z", 64) # DCT
dist = Distributor(coords; device=GPU())
field = ScalarField(dist, "T", (xbasis, zbasis))
# Transforms automatically select FFT or DCT per dimension
forward_transform!(field) # FFT in x, DCT in z
backward_transform!(field) # IFFT in x, IDCT in zGPU Memory Management
Tarang relies on CUDA.jl's built-in memory pool (following the Oceananigans.jl approach). No custom pooling is needed — CUDA.jl handles allocation efficiently.
Configure the pool via environment variable:
export JULIA_CUDA_MEMORY_POOL=binned # default, efficient for repeated allocationsData Transfers
# CPU → GPU
async_copy_to_gpu!(gpu_array, cpu_array)
# GPU → CPU
async_copy_to_cpu!(cpu_array, gpu_array)For pinned memory (faster MPI transfers), use CUDA.jl directly:
CUDA.Mem.pin(cpu_array) # page-lock for faster DMA transfersCustom GPU Kernels
KernelAbstractions Integration
Write portable kernels that run on both CPU and GPU:
using Tarang, KernelAbstractions
# Define a kernel
@kernel function add_kernel!(c, @Const(a), @Const(b))
i = @index(Global)
@inbounds c[i] = a[i] + b[i]
end
# Wrap as KernelOperation
add_op = KernelOperation(add_kernel!) do c, a, b
length(c) # ndrange
end
# Use on any architecture
arch = GPU()
a = ones(arch, Float64, 1024)
b = ones(arch, Float64, 1024)
c = zeros(arch, Float64, 1024)
add_op(arch, c, a, b) # Runs on GPUBuilt-in GPU Kernels
Tarang provides optimized kernels for common operations:
using TarangCUDAExt
# Element-wise operations
gpu_add!(c, a, b) # c = a + b
gpu_sub!(c, a, b) # c = a - b
gpu_mul!(c, a, b) # c = a * b
gpu_scale!(y, α, x) # y = α * x
gpu_axpy!(y, α, x) # y = y + α * x
gpu_linear_combination!(y, α, a, β, b) # y = α*a + β*b
# Fused operations for timestepping
gpu_rk_stage!(u_new, u, k, dt, coeff)
gpu_axpby!(y, α, x, β) # y = α*x + β*y
# Physics kernels
gpu_kinetic_energy_2d!(ke, ux, uy)
gpu_kinetic_energy_3d!(ke, ux, uy, uz)
gpu_viscous_damping!(field, ν, k2)Distributed GPU Computing
MPI + CUDA
For multi-GPU simulations:
using Tarang, MPI, CUDA
MPI.Init()
comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
# Assign GPU to MPI rank
if CUDA.ndevices() >= MPI.Comm_size(comm)
CUDA.device!(rank)
end
# Create distributed GPU setup
coords = CartesianCoordinates("x", "y", "z")
dist = Distributor(coords; mesh=(2, 2), device=GPU())
# Each rank has its own GPU memory
field = ScalarField(dist, "u", bases)CUDA-Aware MPI
For direct GPU-to-GPU communication:
using TarangCUDAExt: check_cuda_aware_mpi
if check_cuda_aware_mpi()
println("CUDA-aware MPI available - using direct GPU transfers")
else
println("Staging through CPU for MPI transfers")
endTransposableField for GPU+MPI
Efficient distributed FFTs with GPU:
# Create transposable field for distributed transforms
field = ScalarField(dist, "u", bases)
tf = TransposableField(field)
# Distributed forward transform (handles GPU transposes)
distributed_forward_transform!(tf)
# Distributed backward transform
distributed_backward_transform!(tf)Performance Optimization
Best Practices
Use Float32 when possible - 2x memory bandwidth, often sufficient accuracy
dist = Distributor(coords; dtype=Float32, device=GPU())Batch operations - Minimize kernel launches
# Bad: many small operations for i in 1:n gpu_scale!(field, α) end # Good: fused operations gpu_rk_stage!(u_new, u, k, dt, coeff)Preallocate buffers - Avoid allocation in hot loops
# Preallocate work arrays work = zeros(GPU(), Float64, size(field["g"])) for step in 1:nsteps # Reuse work array compute!(work, field) endSynchronize when needed - Ensure GPU operations complete before CPU access
CUDA.synchronize() # Wait for all GPU operations to finish
Profiling
using CUDA
# Profile a section
CUDA.@profile begin
forward_transform!(field)
backward_transform!(field)
end
# Time with synchronization
CUDA.@elapsed begin
forward_transform!(field)
CUDA.synchronize()
endMemory Bandwidth Optimization
# Check if transform is memory-bound
n = prod(size(field["g"]))
bytes_transferred = n * sizeof(eltype(field["g"])) * 4 # rough estimate
# Theoretical bandwidth (e.g., A100 = 2 TB/s)
theoretical_time = bytes_transferred / 2e12
# Compare to actual time
actual_time = CUDA.@elapsed forward_transform!(field)
efficiency = theoretical_time / actual_time
println("Memory bandwidth efficiency: $(efficiency * 100)%")Tensor Core Support
For supported operations on Ampere+ GPUs:
using TarangCUDAExt: enable_tensor_cores!, disable_tensor_cores!
# Enable tensor cores (requires compatible data types)
enable_tensor_cores!()
# Disable if numerical precision is critical
disable_tensor_cores!()Troubleshooting
Common Issues
Out of Memory
# Check available memory before large allocations
info = gpu_memory_info()
if info.free_bytes < required_bytes
clear_memory_pool!() # Free cached allocations
GC.gc() # Trigger garbage collection
CUDA.reclaim() # Reclaim CUDA memory
endSlow Performance
# Ensure synchronization isn't killing performance
CUDA.allowscalar(false) # Disable slow scalar indexing
# Check for CPU fallbacks
@assert field["g"] isa CuArray "Data not on GPU!"MPI + CUDA Issues
# Ensure correct GPU assignment
println("Rank $rank using GPU $(CUDA.device())")
# Force synchronization before MPI calls
CUDA.synchronize()
MPI.Barrier(comm)API Reference
See the GPU API documentation for complete function references.
Key Functions
| Function | Description |
|---|---|
GPU() | GPU architecture singleton |
on_architecture(GPU(), array) | Move array to GPU |
forward_transform!(field) | GPU-accelerated forward FFT |
backward_transform!(field) | GPU-accelerated inverse FFT |
set_gpu_fft_mode!(field, mode) | Control FFT backend |
TransposableField(field) | Distributed GPU transforms |