Distributed Computing
MPI Parallelization
Scale your transforms across thousands of cores with PencilArrays
SHTnsKit.jl integrates seamlessly with MPI through PencilArrays.jl for distributed memory parallelization. This enables scaling spherical harmonic transforms to large HPC clusters.
MPI parallelization is most beneficial for large problems (lmax > 128) or when processing many fields simultaneously. For single-field transforms on smaller problems, consider GPU acceleration instead.
Quick Start
using MPI
MPI.Init()
using SHTnsKit, PencilArrays, PencilFFTs
# Configuration
lmax = 64
nlat = lmax + 2
nlon = 2*lmax + 1
cfg = create_gauss_config(lmax, nlat; nlon=nlon)
# Create distributed array
comm = MPI.COMM_WORLD
pen = Pencil((nlat, nlon), comm)
fθφ = PencilArray(pen, zeros(Float64, PencilArrays.size_local(pen)...))
# Fill with test data (Y_2^0 pattern)
ranges = PencilArrays.range_local(pen)
for (i_local, i_global) in enumerate(ranges[1])
x = cfg.x[i_global]
for j in 1:length(ranges[2])
fθφ[i_local, j] = (3*x^2 - 1)/2
end
end
# Distributed transforms
Alm = SHTnsKit.dist_analysis(cfg, fθφ)
fθφ_recovered = SHTnsKit.dist_synthesis(cfg, Alm; prototype_θφ=fθφ, real_output=true)
# Verify
max_err = maximum(abs.(parent(fθφ_recovered) .- parent(fθφ)))
global_max_err = MPI.Allreduce(max_err, MPI.MAX, comm)
if MPI.Comm_rank(comm) == 0
println("Roundtrip error: $global_max_err")
end
destroy_config(cfg)
MPI.Finalize()Installation
Requirements
- Julia 1.10+
- MPI library (OpenMPI or MPICH)
- PencilArrays.jl and PencilFFTs.jl
Setup
Linux (Ubuntu/Debian):
sudo apt-get install libopenmpi-dev openmpi-binmacOS:
brew install open-mpiJulia packages:
using Pkg
Pkg.add(["MPI", "PencilArrays", "PencilFFTs"])API Reference
Distributed Transforms
| Function | Description |
|---|---|
dist_analysis(cfg, fθφ) | Distributed spatial → spectral transform |
dist_synthesis(cfg, Alm; prototype_θφ) | Distributed spectral → spatial transform |
dist_analysis_sphtor(cfg, Vθ, Vφ) | Distributed vector field analysis |
dist_synthesis_sphtor(cfg, Slm, Tlm; prototype_θφ) | Distributed vector field synthesis |
Spectral Operators
| Function | Description |
|---|---|
dist_scalar_laplacian(cfg, fθφ) | Distributed Laplacian computation |
dist_spatial_divergence(cfg, Vθ, Vφ) | Distributed divergence |
dist_spatial_vorticity(cfg, Vθ, Vφ) | Distributed vorticity |
Running MPI Programs
Save your script as transform_mpi.jl and run with:
mpiexec -n 4 julia --project transform_mpi.jlWorking with PencilArrays
Understanding Domain Decomposition
PencilArrays distributes data across MPI ranks by splitting along one or more dimensions:
using MPI, PencilArrays
MPI.Init()
comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
# Create pencil decomposition
pen = Pencil((nlat, nlon), comm)
# Each rank owns a portion of the data
local_size = PencilArrays.size_local(pen)
global_range = PencilArrays.range_local(pen)
println("Rank $rank: local size = $local_size, global range = $global_range")Accessing Local Data
# Create distributed array
fθφ = PencilArray(pen, zeros(Float64, PencilArrays.size_local(pen)...))
# Access local data directly
local_data = parent(fθφ)
# Use local_range() helper from SHTnsKit
for idx in local_range(fθφ)
local_data[idx] = some_value
endVector Field Transforms
using MPI
MPI.Init()
using SHTnsKit, PencilArrays, PencilFFTs
comm = MPI.COMM_WORLD
lmax = 64
nlat, nlon = lmax + 2, 2*lmax + 1
cfg = create_gauss_config(lmax, nlat; nlon=nlon)
# Create distributed vector field components
pen = Pencil((nlat, nlon), comm)
Vθ = PencilArray(pen, zeros(Float64, PencilArrays.size_local(pen)...))
Vφ = PencilArray(pen, zeros(Float64, PencilArrays.size_local(pen)...))
# Fill with test data
ranges = PencilArrays.range_local(pen)
for (i_local, i_global) in enumerate(ranges[1])
θ = cfg.θ[i_global]
for (j_local, j_global) in enumerate(ranges[2])
φ = cfg.φ[j_global]
Vθ[i_local, j_local] = cos(θ) * sin(φ)
Vφ[i_local, j_local] = cos(φ)
end
end
# Distributed vector transform
Slm, Tlm = SHTnsKit.dist_analysis_sphtor(cfg, Vθ, Vφ)
Vθ_out, Vφ_out = SHTnsKit.dist_synthesis_sphtor(cfg, Slm, Tlm; prototype_θφ=Vθ)
destroy_config(cfg)
MPI.Finalize()Best Practices
- Problem size: MPI overhead is significant for small problems. Use lmax > 64 for benefit.
- Process count: Scale processes with problem size. Too many processes hurt efficiency.
- Memory: Each rank needs memory for local data plus communication buffers.
- I/O: Use parallel I/O (HDF5, NetCDF) for large datasets.
See Also
- GPU Acceleration - Single-node GPU speedup
- Performance Guide - Optimization strategies
- API Reference - Complete function documentation