Performance Tips

Performance Tips

Quick tips for reducing allocations and maximizing throughput

This page summarizes practical tips to reduce allocations and improve locality and throughput in SHTnsKit.jl, especially for distributed (MPI + PencilArrays) use.

  • Reuse plans: Construct SHTPlan (serial) and distributed plans (DistAnalysisPlan, DistSphtorPlan, DistQstPlan) once per size and reuse. Plans hold FFT plans and working buffers to avoid per-call allocations.

  • Grid defaults: Gauss grids use phi_scale=:dft (FFT scaling nlon). Regular/Driscoll-Healy grids use phi_scale=:quad (nlon/2π). You can override globally with ENV["SHTNSKIT_PHI_SCALE"]=dft|quad or per-config via phi_scale if you need a specific convention.

  • Low-allocation serial recipe: preallocate FFT scratch and outputs.

    cfg = create_gauss_config(32, 34; nlon=129)
    fft_scratch = scratch_fft(cfg)
    alm = zeros(ComplexF64, cfg.lmax+1, cfg.mmax+1)
    f   = randn(cfg.nlat, cfg.nlon)
    analysis!(cfg, alm, f; fft_scratch=fft_scratch)      # reuse alm + scratch
    f_out = scratch_spatial(cfg)
    synthesis!(cfg, f_out, alm; fft_scratch=fft_scratch) # reuse f_out + scratch
  • Low-allocation distributed recipe: reuse plans with in-plan scratch.

    aplan = DistAnalysisPlan(cfg, proto; use_rfft=true)
    vplan = DistSphtorPlan(cfg, proto; use_rfft=true, with_spatial_scratch=true)
    splan = DistPlan(cfg, proto; use_rfft=true)
    dist_analysis!(aplan, Alm, fθφ)          # no per-call FFT allocs
    dist_synthesis_sphtor!(vplan, Vt, Vp, S, T; real_output=true)
    dist_synthesis!(splan, fθφ, Alm; real_output=true)

    use_rfft trims the spectral grid; with_spatial_scratch keeps a single complex (θ,φ) buffer inside the vector/QST plans so real outputs don’t allocate a fresh iFFT workspace each call.

  • userfft (distributed plans): When available in your PencilFFTs, set `userfft=true` in distributed plans to cut the (θ,k) spectral memory and accelerate real-output paths. The code falls back to complex FFTs when real transforms are not supported.

  • withspatialscratch (vector/QST): Enable with_spatial_scratch=true to keep a single complex (θ,φ) scratch in the plan. This removes per-call iFFT allocations for real outputs. Default remains off to minimize footprint.

  • Precomputed Legendre tables: On fixed grids, call enable_plm_tables!(cfg) to precompute plm_tables and dplm_tables. They provide identical results to on-the-fly recurrences and usually reduce CPU cost.

  • Threading inside rank: For large lmax, enable Julia threads and (optionally) FFTW threads. Use set_optimal_threads!() or tune with set_threading!() and set_fft_threads() to match your core layout.

  • LoopVectorization: If available, analysis_turbo/synthesis_turbo and related helpers can accelerate inner loops. Guard with using LoopVectorization.

  • Data locality by m: Keep Alm distributed by m throughout your pipeline to avoid dense gathers. The distributed plans in this package consume and produce m-sliced data to preserve cache locality.