Running with MPI
Tarang.jl is designed for efficient parallel computing using MPI (Message Passing Interface). This guide covers everything you need to know about running Tarang.jl simulations in parallel.
Basic MPI Execution
Running a Script
Execute a Tarang.jl script with MPI using mpiexec:
mpiexec -n 4 julia your_script.jlThe -n 4 flag specifies 4 MPI processes. This must match your process mesh configuration in the script.
Process Mesh Configuration
The process mesh determines how MPI processes are arranged:
# 2D process mesh (4 processes: 2×2)
dist = Distributor(coords, mesh=(2, 2))
# 1D process mesh (4 processes: 4×1)
dist = Distributor(coords, mesh=(4, 1))
# 3D process mesh (8 processes: 2×2×2)
dist = Distributor(coords, mesh=(2, 2, 2))Ensure product(mesh) == number of MPI processes:
mesh=(2, 2) # requires: mpiexec -n 4
mesh=(4, 2) # requires: mpiexec -n 8
mesh=(4, 4) # requires: mpiexec -n 16Process Mesh Strategies
2D Problems
For 2D problems, you can parallelize in both directions:
# Balanced 2D decomposition (recommended)
mesh=(4, 4) # 16 processes, good for most 2D problems
# Horizontal decomposition (for thin domains)
mesh=(8, 2) # 16 processes, more processes in x-direction
# Vertical decomposition (for tall domains)
mesh=(2, 8) # 16 processes, more processes in z-directionRule of thumb: Match the mesh aspect ratio to your domain aspect ratio.
3D Problems
For 3D problems, Tarang.jl uses pencil decomposition:
# Cubic mesh (for isotropic domains)
mesh=(4, 4, 4) # 64 processes
# Anisotropic mesh (for stratified flows)
mesh=(8, 8, 2) # 128 processes, fewer in vertical directionPencil decomposition means data is distributed in two dimensions while remaining contiguous in the third. This enables efficient parallel FFTs.
Environment Variables
Thread Control
Set OpenMP thread count to avoid oversubscription:
export OMP_NUM_THREADS=1
mpiexec -n 8 julia script.jlNot setting OMP_NUM_THREADS=1 can cause significant performance degradation. Tarang.jl will warn you if this is not set correctly.
Julia Threads
Julia's multithreading can work alongside MPI:
export JULIA_NUM_THREADS=4
export OMP_NUM_THREADS=1
mpiexec -n 4 julia script.jlThis gives you 4 MPI processes × 4 Julia threads = 16 parallel tasks.
Other Useful Variables
# FFTW optimization
export FFTW_PLANNING_RIGOR=FFTW_MEASURE
# Tarang logging
export TARANG_LOG_LEVEL=DEBUG
# MPI debugging
export OMPI_MCA_mpi_show_mca_params=1HPC Cluster Execution
SLURM
Example SLURM submission script:
#!/bin/bash
#SBATCH --job-name=tarang_sim
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=32
#SBATCH --time=24:00:00
#SBATCH --partition=compute
# Load modules
module load julia/1.9
module load openmpi/4.1
# Set environment
export OMP_NUM_THREADS=1
export JULIA_NUM_THREADS=1
# Calculate total tasks
NTASKS=$((SLURM_NNODES * SLURM_NTASKS_PER_NODE))
# Run simulation
srun -n $NTASKS julia --project=. simulation.jlSubmit with:
sbatch submit_tarang.shPBS/Torque
Example PBS script:
#!/bin/bash
#PBS -N tarang_sim
#PBS -l nodes=4:ppn=32
#PBS -l walltime=24:00:00
#PBS -q batch
cd $PBS_O_WORKDIR
module load julia/1.9
module load openmpi/4.1
export OMP_NUM_THREADS=1
mpiexec julia --project=. simulation.jlSubmit with:
qsub submit_tarang.pbsLoad Balancing
Automatic Load Balancing
Tarang.jl automatically distributes work across MPI processes using PencilArrays. For balanced performance:
- Use power-of-2 process counts when possible (4, 8, 16, 32, ...)
- Match mesh to domain: Aspect ratio of mesh should match domain
- Consider memory: Each process needs enough RAM for its subdomain
Checking Load Distribution
Add diagnostics to your script:
rank = MPI.Comm_rank(MPI.COMM_WORLD)
size = MPI.Comm_size(MPI.COMM_WORLD)
local_size = size(T.data) # Size of local data on this rank
println("Rank $rank: Local array size = $local_size")Communication Patterns
PencilArrays Transposes
Tarang.jl uses PencilArrays for efficient data distribution. Key operations:
- FFTs: May require transposes between different pencil orientations
- Derivatives: Computed in spectral space (minimal communication)
- Nonlinear terms: Evaluated in grid space (may require transforms)
Minimizing Communication
To reduce communication overhead:
- Group operations: Batch multiple operations before synchronizing
- Use larger process counts: More processes = smaller messages
- Optimize process mesh: Align with dominant communication direction
Performance Monitoring
MPI Profiling
Use MPI profiling tools to identify bottlenecks:
# With mpiP
mpiexec -n 8 julia --project=. script.jl
# With Intel MPI
export I_MPI_STATS=20
mpiexec -n 8 julia script.jlTarang.jl Built-in Profiling
Enable performance logging:
using Tarang
# Setup logging with MPI awareness
Tarang.setup_tarang_logging(
level="INFO",
filename="performance.log",
mpi_aware=true,
console=true
)Timing Critical Sections
Add timing to your simulation:
using Printf
t_start = time()
# ... simulation code ...
t_end = time()
elapsed = t_end - t_start
if MPI.Comm_rank(MPI.COMM_WORLD) == 0
@printf "Total time: %.2f seconds\n" elapsed
endDebugging MPI Applications
Running in Serial
Test your code without MPI first:
julia script.jl # No mpiexecModify your script to use 1 process:
dist = Distributor(coords, mesh=(1, 1))MPI Debugging Tools
Use parallel debuggers:
# With TotalView
mpiexec -tv -n 4 julia script.jl
# With DDT (ARM Forge)
ddt mpiexec -n 4 julia script.jl
# With gdb (attach to rank 0)
mpiexec -n 4 xterm -e gdb julia script.jlRank-Specific Output
Debug specific MPI ranks:
rank = MPI.Comm_rank(MPI.COMM_WORLD)
if rank == 0
println("Debug: Rank 0 data = ", data)
end
# Or debug all ranks
println("Rank $rank: data = $data")Common Issues and Solutions
Deadlocks
Symptom: Program hangs without error
Causes:
- Mismatched collective operations
- Ranks waiting for different communications
- Incorrect synchronization
Solution: Ensure all ranks participate in collective operations:
# All ranks must call collective operations
MPI.Barrier(MPI.COMM_WORLD) # All ranks must call thisLoad Imbalance
Symptom: Some ranks finish much faster than others
Solution: Check domain decomposition and adjust mesh:
# Before (imbalanced for tall domain)
mesh=(8, 2)
# After (better for tall domain)
mesh=(4, 4)Memory Issues
Symptom: Out of memory errors on some ranks
Solution: Reduce resolution or increase process count:
# Reduce resolution
x_basis = RealFourier(coords["x"], size=512, ...) # was 1024
# Or use more processes to distribute memory
mesh=(8, 8) # was (4, 4)Wrong Number of Processes
Symptom: ERROR: Process count mismatch
Solution: Match mpiexec -n to mesh product:
mesh=(4, 2) # requires mpiexec -n 8Performance Tips
Optimal Process Count
- Start with square meshes:
mesh=(4,4),(8,8), etc. - Profile with different counts: Try 4, 8, 16, 32 processes
- Check scaling: Plot speedup vs. process count
- Consider communication: More processes = more communication overhead
Node-Level Optimization
On multi-socket nodes:
# Bind processes to cores
mpiexec -n 16 --bind-to core julia script.jl
# Use one process per socket
mpiexec -n 2 --map-by socket julia script.jlNetwork Optimization
For InfiniBand or high-speed networks:
# Enable UCX (if available)
export OMPI_MCA_pml=ucx
export OMPI_MCA_osc=ucx
mpiexec -n 32 julia script.jlBenchmarking
Weak Scaling
Keep local problem size constant, increase total size:
# 4 processes: 128×64 per process
mesh=(2, 2); x_size=256; z_size=128
# 16 processes: 128×64 per process
mesh=(4, 4); x_size=512; z_size=256
# 64 processes: 128×64 per process
mesh=(8, 8); x_size=1024; z_size=512Ideal weak scaling: time remains constant.
Strong Scaling
Keep total problem size constant, increase processes:
# All use: x_size=1024, z_size=512
mesh=(2, 2) # 4 processes
mesh=(4, 4) # 16 processes
mesh=(8, 8) # 64 processesIdeal strong scaling: time decreases linearly with processes.
Next Steps
- First Steps: Basic Tarang.jl workflow
- Tutorials: Example simulations with MPI
- Configuration: Advanced MPI configuration options
- Parallelism Guide: Deep dive into parallel algorithms