Bigger, cleaner, faster

notes on the implementation of a powerful, flexible, high-performance SPH computational engine



First implementation of WCSPH to run entirely on GPU with CUDA.
Version 5 released on June 13, 2019.

Aims to be a universal SPH computational engine.



Simulation framework

Too many options?

4×4×4×5×32×10×4×3×3×2×212=222×34×52= 4 \times 4 \times 4 \times 5 \times 3^2 \times 10 \times 4 \times 3 \times 3 \times 2 \times 2^12 = 2^22 \times 3^4 \times 5^2 = {} =84934656008.5×109 {} = 8493465600 \simeq 8.5\times 10^9

potential simulation framework combinations.

And it keeps growing!
(e.g. not included: thermal and thermo-dynamic problems)

Not all combinations are currently supported. We’re working on it!


no performance penalty from unused options
runtimes without a given option should be the same as if support for that option had not been implemented at all;
minimize implementation complexity
fighting against Tesler’s law of conservation of complexity;



Nested calls for the Navier–Stokes forces computation (zoom on laminar visc)
Nested calls for the Navier–Stokes forces computation (zoom on laminar visc)

Example #1: laminar visc contrib

Morris Kjωij2μijFijuij\phantom{K}\sum_j \omega_{ij} 2 \bar\mu_{ij} F_{ij} \vec u_{ij}
Monaghan Kjωij2μijFijrijuijrijrijrijK \sum_j \omega_{ij} 2 \bar\mu_{ij} F_{ij} \frac{\vec r_{ij}\cdot \vec u_{ij}}{\vec r_{ij} \cdot \vec r_{ij}} \vec r_{ij}
Español & Revenga KjωijFij((53μijζij)uij+5(13μij+ζij)rijuijrijrijrij)\phantom{K}\sum_j \omega_{ij} F_{ij} \left( \left(\frac{5}{3} \bar\mu_{ij} - \bar\zeta_{ij}\right) \vec u_{ij} + 5 \left( \frac{1}{3}\bar\mu_{ij} + \bar\zeta_{ij}\right) \frac{\vec r_{ij}\cdot \vec u_{ij}}{\vec r_{ij} \cdot \vec r_{ij}} \vec r_{ij} \right)

where rijFij=Wij\vec r_{ij} F_{ij} = \nabla W_{ij}, μij\bar\mu_{ij} and ζij\bar\zeta_{ij} are some average first and second viscosity,
and ωij\omega_{ij} is the volume-based neighbor weight.

How many specializations do we need?

Neighbor weight vs average visc

ωij=mjρiρj\omega_{ij} = \frac{m_j}{\rho_i \rho_j} except for Grenier ωij=ωj(1σi+1σj)\omega_{ij} = \omega_j\left(\frac{1}{\sigma_i} + \frac{1}{\sigma_j}\right).

Harmonic mean of constant kinematic viscosity: μij=2νρiρjρi+ρj\bar\mu_{ij} = 2\nu \frac{\rho_i \rho_j}{\rho_i + \rho_j}

When possible, simplify ωij2μij=4νmjρi+ρj\omega_{ij} 2 \bar\mu_{ij} = 4 \nu \frac{m_j}{\rho_i + \rho_j}.

We want a specialization for simulations with a single, Newtonian fluid when using kinematic internal viscosity representation with harmonic averaging, except when using Grenier’s formulation or the Español & Revenga viscous model.

Example #2: rheology

Rheological contribution to the viscosity
Yield strength Shear rate dependency
Linear Power-law Exponential
none Newton Power-law
constant Bingham Herschel–Bulkley DeKee & Turcotte
regularized Papanastasiou Alexandrou Zhu
functional Granular

Separate contributions and share common code: 4+3=74+3=7 functions instead of 99 (of up to 4×3=124\times3 = 12).

And no repeated code (DRY)!

Implementation: rheology (naive)

void shear_rate_contrib(...) {
    if (linear_rheology(...)) { /* linear contrib */ }
    else if (power_law_rheology(...)) { /* power-law contrib */ }
    else if (exponential_rheology(...)) { /* exponential contrib */ }

void yield_strength_contrib(...) {
    if (no_yield_strength(...)) return;
    if (constant_ys(...)) { /* standard yield strength contrib */ }
    else if (regularized_ys(...)) { /* regularized yield strength contrib */}
    else if (functional_ys(...)) { /* compute yield strength and apply contrib */ }

void calc_visc(...) {

What’s wrong with that?

Compiler may or may not see all optimization opportunities (e.g. dead code elimination).

Compiler may or may not minimize use of variables (more variables = more registers = register spills = slower code).

Trust the compiler? 🙏

Help the compiler! 🏄

What we want (rheology)

(...) shear_rate_contrib(...) { /* linear contrib */ }
(...) shear_rate_contrib(...) { /* power-law contrib */ }
(...) shear_rate_contrib(...) { /* exponential contrib */ }

(...) yield_strength_contrib(...) { return; /* no yield strength */ }
(...) yield_strength_contrib(...) { /* standard yield strength contrib */ }
(...) yield_strength_contrib(...) { /* regularized yield strength contrib */}
(...) yield_strength_contrib(...) { /* compute yield strength and apply contrib */ }

void calc_visc(...) {




template<KernelType kerneltype> float W(float r, float slength);

template<> float W<QUARTIC>(float r, float slength)
{ /* compute and return QUARTIC kernel value */ }

template<> float W<CUBIC>(float r, float slength)
{ /* compute and return CUBIC kernel value */ }

/* etc for the other values of KernelType */

/* same mechanism also used for F */

Compile-time equivalent of a switch / case statement. Selections based on a single option value.

What about more complex conditionals?


/* A type that is T if B is true, invalid otherwise.
 * Pre-defined in C++14, can be defined in C++11.
 * Older versions can define enable_if<B, T>::type
enable_if_t< bool B, typename T = void>

Requirement: B must be evaluable at compile-time:

Substitution Failure Is Not An Error ⇒ when enable_if fails, the corresponding specialization gets skipped without errors.


constexpr bool linear_rheology(RheologyType rheologytype)
{ return /* conditions for which a rheology is linear */ }

constexpr bool power_law_rheology(RheologyType rheologytype)
{ return /* conditions for a power-law rheology */ }

template<RheologyType rheologytype>
enable_if_t< linear_rheology(rheologytype) > /* skipped if not linear */
{ /* linear contrib */ }

template<RheologyType rheologytype>
enable_if_t< power_law_rheology(rheologytype) > /* skipped if not power-law */
{ /* power-law contrib */ }

Conditional structures

Structures with optional members, only present if a condition is satisfied.

struct effvisc_params {
    float *effvisc;
    effvisc_params(float *effvisc_) : effvisc(effvisc_) {}

template<typename Framework,
    typename optional_effvisc = cond_struct< needs_effvisc<Framework>(), effvisc_params >
struct all_params : optional_effvisc
    float *forces;

    all_params(...) :

Conditional structures/2


// Standard struct that maps type to T if B, else to F
template<bool B, typename T, typename F> struct conditional;

// A struct with no members to “gobble” any other type
struct empty_struct {
    empty_struct() {} /* empty constructor */

    // Universal constructor
    template<typename ...T>
    empty_struct(T const&...) {}

// type alias template for C++11; for older versions of the standard
// one can use a less robust COND_STRUCT(B, T) macro with heavier syntax 
template<bool B, typename T>
using cond_struct = conditional<B, T, empty_struct<T>>::type;

A(n almost) real example

template<KernelType _kerneltype, SPHFormulation _sph_formulation,
    DensityDiffusionType _densitydiffusiontype, /* more framework params omitted */
    flag_t _simflags,
    ParticleType _cptype, ParticleType _nptype,
    /* auxiliary defines for some conditionals, omitted */
    typename xsph_cond = cond_struct< (_simflags & ENABLE_XSPH) && _cptype == _nptype,
    typename vol_cond = cond_struct<
        _sph_formulation == SPH_GRENIER && _densitydiffusiontype == COLAGROSSI,
    typename grenier_cond =
        cond_struct< _sph_formulation == SPH_GRENIER, grenier_forces_params>,
    /* more conditional structures definitions, omitted */
struct forces_params : // the type template that will be passed to the computational kernels
    // empty_struct must be template or multiple missing optional substructures would error
    xsph_cond, vol_cond, grenier_cond,
    /* more conditional structures omitted */
    /* see next slide */

A(n almost) real example/2

template<...> struct forces_params : ... {
 // allow compile-time extraction of template parameters from the struct
 static constexpr KernelType kerneltype = _kerneltype;
 static constexpr SPHFormulation sph_formulation = _sph_formulation;
 /* etc */
 forces_params(...) : common_forces_params(...), xpsh_cond(...), vol_cond(...), ... {}

template<typename FP /* any instance of forces_params */>
void forcesDevice(FP params) { // local variables are held in conditional structures too!
 particle_data<FP> pdata(params); // central particle data
 particle_output<FP> pout(params, pdata); // particle output
 for_each_neib(FP::nptype, ...) { // iterate over neighbors of type nptype
   neib_data<FP> ndata(params, pdata); // neighbor data
   neib_output<FP> nout(params, pdata, ndata); // neighbor contribution
   /* function that abstracts the logic of the interaction, that calls other function
    * that handle specific parts, etc down the final specializations of each contribution
   particle_particle_interaction(params, pdata, ndata, pout, nout);
 write_out(params, pout); // save the results for this particle

A(n almost) real example/3

/// INVISCID viscous model
/*! No actual contribution */
template<typename FP, typename P, typename N, typename OP, typename ON>
enable_if_t< FP::inviscid > compute_laminar_visc_contrib
(FP const& params, P const& pdata, N const& ndata, OP &pout, ON &nout)
{ /* do nothing */ }

/// Standard volumic viscous contribution
/*! This is used when the neighbor is not a boundary particle, or for dynamic boundaries,
 *  but not with Grenier's formulation or Español & Revenga's viscous model
template<typename FP, typename P, typename N, typename OP, typename ON>
enable_if_t< (not FP::inviscid) && wants_volumic_visc_term<FP>() &&
    FP::viscmodel != ESPANOL_REVENGA && FP::sph_formulation != SPH_GRENIER >
(FP const& params, P const& pdata, N const& ndata, OP &pout, ON &nout)
    const float visc = visc_avg<FP::ViscSpec>(
            get_visc_coeff(pdata), get_visc_coeff(ndata),
            physical_density(pdata), physical_density(ndata), ndata.relPos.w);
    nout.DvDt += visc*ndata.f*viscous_vector_component(params, pdata, ndata);

/* and several other specializations */

Split neighbors

forces_params (and thus forcesDevice) have the central and neighbor particle type as template parameters ⇒ individual specializations per pair of particle types ⇒ no runtime conditionals based on neighbor type.

Main structure is (mostly) the same for all pairs ⇒ actual specialization is typically only at or near the “bottom of the stack” (e.g. selected through wants_volumic_visc_term).

How do we iterate efficiently on neighbors of a single type?

Split neighbors/2

Top left CPU-preferred neighbors list layout
Top middle GPU-preferred neighbors list layout
Right GPU-preferred splitneibs list layout


Are we there yet?

the number of framework options is growing at an increasing rate;
much more solid base to work on (easier merge of some recently developed features);
improved runtimes (depending on workload and architectures; most complex configuration: over 30% faster on older hardware, around 18% faster on current GPUs);