Bigger, cleaner, faster

GPUSPH

First implementation of WCSPH to run entirely on GPU with CUDA.
Version 5 released on June 13, 2019. https://www.gpusph.org/

Aims to be a universal SPH computational engine.

Applications

geophysics, industry, coastal engineering, energy generation, …

Requirements

performance;

robustness;
Research in WCSPH

new formulations, more physics, …

Requirements

flexible, extensible code;

correctness;

Simulation framework

Smoothing kernel
- Quadratic
- Cubic spline
- Wendland
- Gaussian
SPH formulation
- WCSPH single-fluid
- WCSPH multi-fluid
- Grenier
- Hu & Adams
Density diffusion
- (none)
- Ferrari
- Molteni & Colagrossi
- Brezzi
Boundary model
- Lennard–Jones
- Monaghan–Kajtar
- dynamic
- semi-analytical
- dummy
Periodicity (any combination of)
- X
- Y
- Z
Rheological models
- Inviscid
- Newton
- Granular
- Bingham
- Papanastasiou
- Power-law
- Herschel–Bulkley
- Alexandrou
- DeKee & Turcotte
- Zhu
Turbulence model
- (none)
- $k$ - $\epsilon$
- Sub Particle Scale (SPS)
- artificial viscosity (yes, we know)
Viscous model
- Morris
- Monaghan
- Español & Revenga
Averaging operator
- arithmetic
- harmonic
- geometric
Internal viscosity representation
- Dynamic
- Kinematic
Miscellanea (boolean flags)
- Adaptive time-stepping
- XSPH
- Geometric planes support
- Geometric natural topography support
- Moving bodies support
- Open boundaries support
- Water depth computation
- Density summation
- Semi-analytical gamma quadrature
- Internal energy computation
- Multi-fluid support
- Implicit integration of the viscous term

Too many options?

$4 \times 4 \times 4 \times 5 \times 3^2 \times 10 \times 4 \times 3 \times 3 \times 2 \times 2^12 = 2^22 \times 3^4 \times 5^2 = {}$ ${} = 8493465600 \simeq 8.5\times 10^9$

potential simulation framework combinations.

And it keeps growing!
(e.g. not included: thermal and thermo-dynamic problems)

Not all combinations are currently supported. We’re working on it!

Objectives

no performance penalty from unused options: runtimes without a given option should be the same as if support for that option had not been implemented at all;
minimize implementation complexity: fighting against Tesler’s law of conservation of complexity;

Strategy

no performance penalty ✔
- do as much work as possible at compile time;
- prevent the compiler from even seeing the code that shouldn’t run in a given configuration (specialized versions of functions based on framework options);
- prevent the compiler from even seeing the variables that aren’t used in a given configuration (specialized types based on framework options);
minimize complexity ✔
- aim for simple/small functions (or function templates);
- logic abstraction;
- DRY (Don’t Repeat Yourself);
Tesler’s law ✘
- increased levels of indirection;
- spread out code;
- non-trivial programming techniques (e.g. metaprogramming);

Indirection

Nested calls for the Navier–Stokes forces computation (zoom on laminar visc)

Example #1: laminar visc contrib

Morris	$\phantom{K}\sum_j \omega_{ij} 2 \bar\mu_{ij} F_{ij} \vec u_{ij}$
Monaghan	$K \sum_j \omega_{ij} 2 \bar\mu_{ij} F_{ij} \frac{\vec r_{ij}\cdot \vec u_{ij}}{\vec r_{ij} \cdot \vec r_{ij}} \vec r_{ij}$
Español & Revenga	$\phantom{K}\sum_j \omega_{ij} F_{ij} \left( \left(\frac{5}{3} \bar\mu_{ij} - \bar\zeta_{ij}\right) \vec u_{ij} + 5 \left( \frac{1}{3}\bar\mu_{ij} + \bar\zeta_{ij}\right) \frac{\vec r_{ij}\cdot \vec u_{ij}}{\vec r_{ij} \cdot \vec r_{ij}} \vec r_{ij} \right)$

where $\vec r_{ij} F_{ij} = \nabla W_{ij}$ , $\bar\mu_{ij}$ and $\bar\zeta_{ij}$ are some average first and second viscosity,
and $\omega_{ij}$ is the volume-based neighbor weight.

How many specializations do we need?

Neighbor weight vs average visc

$\omega_{ij} = \frac{m_j}{\rho_i \rho_j}$ except for Grenier $\omega_{ij} = \omega_j\left(\frac{1}{\sigma_i} + \frac{1}{\sigma_j}\right)$ .

Harmonic mean of constant kinematic viscosity: $\bar\mu_{ij} = 2\nu \frac{\rho_i \rho_j}{\rho_i + \rho_j}$

When possible, simplify $\omega_{ij} 2 \bar\mu_{ij} = 4 \nu \frac{m_j}{\rho_i + \rho_j}$ .

We want a specialization for simulations with a single, Newtonian fluid when using kinematic internal viscosity representation with harmonic averaging, except when using Grenier’s formulation or the Español & Revenga viscous model.

Example #2: rheology

Rheological contribution to the viscosity
Yield strength	Shear rate dependency
	Linear	Power-law	Exponential
none	Newton	Power-law
constant	Bingham	Herschel–Bulkley	DeKee & Turcotte
regularized	Papanastasiou	Alexandrou	Zhu
functional	Granular

Separate contributions and share common code: $4+3=7$ functions instead of $9$ (of up to $4\times3 = 12$ ).

And no repeated code (DRY)!

Implementation: rheology (naive)

void shear_rate_contrib(...) {
    if (linear_rheology(...)) { /* linear contrib */ }
    else if (power_law_rheology(...)) { /* power-law contrib */ }
    else if (exponential_rheology(...)) { /* exponential contrib */ }
}

void yield_strength_contrib(...) {
    if (no_yield_strength(...)) return;
    if (constant_ys(...)) { /* standard yield strength contrib */ }
    else if (regularized_ys(...)) { /* regularized yield strength contrib */}
    else if (functional_ys(...)) { /* compute yield strength and apply contrib */ }
}

void calc_visc(...) {
    compute_shear_rate(...);
    shear_rate_contrib(...);
    yield_strength_contrib(...);
}

What’s wrong with that?

Compiler may or may not see all optimization opportunities (e.g. dead code elimination).

Compiler may or may not minimize use of variables (more variables = more registers = register spills = slower code).

Trust the compiler? 🙏

Help the compiler! 🏄

What we want (rheology)

(...) shear_rate_contrib(...) { /* linear contrib */ }
(...) shear_rate_contrib(...) { /* power-law contrib */ }
(...) shear_rate_contrib(...) { /* exponential contrib */ }

(...) yield_strength_contrib(...) { return; /* no yield strength */ }
(...) yield_strength_contrib(...) { /* standard yield strength contrib */ }
(...) yield_strength_contrib(...) { /* regularized yield strength contrib */}
(...) yield_strength_contrib(...) { /* compute yield strength and apply contrib */ }

void calc_visc(...) {
    compute_shear_rate(...);
    shear_rate_contrib(...);
    yield_strength_contrib(...);
}

one (version of the) function for each specialization, where the specialization may depend on sophisticated conditionals on the framework parameters;
no unnecessary arguments and variables, but same call syntax for all functions: same name, “same” parameters, “same” return type.

Templates

enum KernelType { QUARTIC, CUBIC, WENDLAND, GAUSSIAN };

template<KernelType kerneltype> float W(float r, float slength);

template<> float W<QUARTIC>(float r, float slength)
{ /* compute and return QUARTIC kernel value */ }

template<> float W<CUBIC>(float r, float slength)
{ /* compute and return CUBIC kernel value */ }

/* etc for the other values of KernelType */

/* same mechanism also used for F */

Compile-time equivalent of a switch / case statement. Selections based on a single option value.

What about more complex conditionals?

SFINAE

/* A type that is T if B is true, invalid otherwise.
 * Pre-defined in C++14, can be defined in C++11.
 * Older versions can define enable_if<B, T>::type
 */
enable_if_t< bool B, typename T = void>

Requirement: B must be evaluable at compile-time:

C++11: constexpr bool f(...) { return ... } with compile-time constant arguments;
Older: template<...> struct f { static const bool value = ... }

Substitution Failure Is Not An Error ⇒ when enable_if fails, the corresponding specialization gets skipped without errors.

SFINAE/2

constexpr bool linear_rheology(RheologyType rheologytype)
{ return /* conditions for which a rheology is linear */ }

constexpr bool power_law_rheology(RheologyType rheologytype)
{ return /* conditions for a power-law rheology */ }

template<RheologyType rheologytype>
enable_if_t< linear_rheology(rheologytype) > /* skipped if not linear */
shear_rate_contrib(...)
{ /* linear contrib */ }

template<RheologyType rheologytype>
enable_if_t< power_law_rheology(rheologytype) > /* skipped if not power-law */
shear_rate_contrib(...)
{ /* power-law contrib */ }

Conditional structures

Structures with optional members, only present if a condition is satisfied.

struct effvisc_params {
    float *effvisc;
    effvisc_params(float *effvisc_) : effvisc(effvisc_) {}
};

template<typename Framework,
    typename optional_effvisc = cond_struct< needs_effvisc<Framework>(), effvisc_params >
>
struct all_params : optional_effvisc
{
    float *forces;

    all_params(...) :
        forces(...),
        optional_effvisc(...)
    {}
};

Conditional structures/2

Framework:

// Standard struct that maps type to T if B, else to F
template<bool B, typename T, typename F> struct conditional;

// A struct with no members to “gobble” any other type
template<typename>
struct empty_struct {
    empty_struct() {} /* empty constructor */

    // Universal constructor
    template<typename ...T>
    empty_struct(T const&...) {}
};

// type alias template for C++11; for older versions of the standard
// one can use a less robust COND_STRUCT(B, T) macro with heavier syntax 
template<bool B, typename T>
using cond_struct = conditional<B, T, empty_struct<T>>::type;

A(n almost) real example

template<KernelType _kerneltype, SPHFormulation _sph_formulation,
    DensityDiffusionType _densitydiffusiontype, /* more framework params omitted */
    flag_t _simflags,
    ParticleType _cptype, ParticleType _nptype,
    /* auxiliary defines for some conditionals, omitted */
    typename xsph_cond = cond_struct< (_simflags & ENABLE_XSPH) && _cptype == _nptype,
        xsph_forces_params>,
    typename vol_cond = cond_struct<
        _sph_formulation == SPH_GRENIER && _densitydiffusiontype == COLAGROSSI,
        volume_forces_params>,
    typename grenier_cond =
        cond_struct< _sph_formulation == SPH_GRENIER, grenier_forces_params>,
    /* more conditional structures definitions, omitted */
>
struct forces_params : // the type template that will be passed to the computational kernels
    common_forces_params,
    // empty_struct must be template or multiple missing optional substructures would error
    xsph_cond, vol_cond, grenier_cond,
    /* more conditional structures omitted */
{
    /* see next slide */
};

A(n almost) real example/2

template<...> struct forces_params : ... {
 // allow compile-time extraction of template parameters from the struct
 static constexpr KernelType kerneltype = _kerneltype;
 static constexpr SPHFormulation sph_formulation = _sph_formulation;
 /* etc */
 forces_params(...) : common_forces_params(...), xpsh_cond(...), vol_cond(...), ... {}
};

template<typename FP /* any instance of forces_params */>
void forcesDevice(FP params) { // local variables are held in conditional structures too!
 particle_data<FP> pdata(params); // central particle data
 particle_output<FP> pout(params, pdata); // particle output
 for_each_neib(FP::nptype, ...) { // iterate over neighbors of type nptype
   neib_data<FP> ndata(params, pdata); // neighbor data
   neib_output<FP> nout(params, pdata, ndata); // neighbor contribution
   /* function that abstracts the logic of the interaction, that calls other function
    * that handle specific parts, etc down the final specializations of each contribution
    */
   particle_particle_interaction(params, pdata, ndata, pout, nout);
 }
 write_out(params, pout); // save the results for this particle
}

A(n almost) real example/3

/// INVISCID viscous model
/*! No actual contribution */
template<typename FP, typename P, typename N, typename OP, typename ON>
enable_if_t< FP::inviscid > compute_laminar_visc_contrib
(FP const& params, P const& pdata, N const& ndata, OP &pout, ON &nout)
{ /* do nothing */ }

/// Standard volumic viscous contribution
/*! This is used when the neighbor is not a boundary particle, or for dynamic boundaries,
 *  but not with Grenier's formulation or Español & Revenga's viscous model
 */
template<typename FP, typename P, typename N, typename OP, typename ON>
enable_if_t< (not FP::inviscid) && wants_volumic_visc_term<FP>() &&
    FP::viscmodel != ESPANOL_REVENGA && FP::sph_formulation != SPH_GRENIER >
compute_laminar_visc_contrib
(FP const& params, P const& pdata, N const& ndata, OP &pout, ON &nout)
{
    const float visc = visc_avg<FP::ViscSpec>(
            get_visc_coeff(pdata), get_visc_coeff(ndata),
            physical_density(pdata), physical_density(ndata), ndata.relPos.w);
    nout.DvDt += visc*ndata.f*viscous_vector_component(params, pdata, ndata);
}

/* and several other specializations */

Split neighbors

forces_params (and thus forcesDevice) have the central and neighbor particle type as template parameters ⇒ individual specializations per pair of particle types ⇒ no runtime conditionals based on neighbor type.

Main structure is (mostly) the same for all pairs ⇒ actual specialization is typically only at or near the “bottom of the stack” (e.g. selected through wants_volumic_visc_term).

How do we iterate efficiently on neighbors of a single type?

Split neighbors/2

Top left CPU-preferred neighbors list layout

Top middle GPU-preferred neighbors list layout

Right GPU-preferred *splitneibs* list layout

Top left	CPU-preferred neighbors	list layout
Top middle	GPU-preferred neighbors	list layout
Right	GPU-preferred splitneibs	list layout

Are we there yet?

Bigger: the number of framework options is growing at an increasing rate;
Cleaner: much more solid base to work on (easier merge of some recently developed features);
Faster: improved runtimes (depending on workload and architectures; most complex configuration: over 30% faster on older hardware, around 18% faster on current GPUs);