SIMDLib

This directory contains the tinysimd library, a header only library. The library is a light, zero-overhead wrapper based on template meta-programming that automatically selects a SIMD intrinsics from the most specialized x86-64 instruction set extension available among SSE2 (limited support), AVX2 and AVX512 (not tested). The library is designed to be easily extended to other architectures.

To use the library one needs to import the tinysimd.hpp header. The type traits routines, needed for templated programming are available in the traits.hpp header. It is highly discouraged to perform IO with vector types. If IO is needed for debugging, one needs to import the io.hpp header.

To enable the vector types in Nektar++ you need to set to ON the desired extension (for instance NEKTAR_ENABLE_SIMD_AVX2). This will automatically set the appropriate compiler flags. However notice that currently these are set correctly only for gcc and that you might need to delete the cached variable CMAKE_CXX_FLAGS before configuring cmake. You can check that the desired vector extension was compiled properly by running the VecDataUnitTests which prints out the extension in use.

A realistic example: an example of a more realistic usage can be found in the SIMD version of the Vmath routines

1namespace SIMD
2{
3    /// \brief Multiply vector z = x + y
4    template<class T,typename = typename std::enable_if
5        <std::is_floating_point<T>::value>::type>
6    void Vadd(const size_t n, const T *x,  const T *y, T *z)
7    {
8        using namespace tinysimd;
9        using vec_t = simd<T>;
10
11        size_t cnt = n;
12        // Vectorized loop unroll 4x
13        while (cnt >= 4*vec_t::width)
14        {
15            // load
16            vec_t yChunk0, yChunk1, yChunk2, yChunk3;
17            yChunk0.load(y, is_not_aligned);
18            yChunk1.load(y + vec_t::width, is_not_aligned);
19            yChunk2.load(y + 2*vec_t::width, is_not_aligned);
20            yChunk3.load(y + 3*vec_t::width, is_not_aligned);
21
22            vec_t xChunk0, xChunk1, xChunk2, xChunk3;
23            xChunk0.load(x, is_not_aligned);
24            xChunk1.load(x + vec_t::width, is_not_aligned);
25            xChunk2.load(x + 2*vec_t::width, is_not_aligned);
26            xChunk3.load(x + 3*vec_t::width, is_not_aligned);
27
28            // z = x + y
29            vec_t zChunk0 = xChunk0 + yChunk0;
30            vec_t zChunk1 = xChunk1 + yChunk1;
31            vec_t zChunk2 = xChunk2 + yChunk2;
32            vec_t zChunk3 = xChunk3 + yChunk3;
33
34            // store
35            zChunk0.store(z, is_not_aligned);
36            zChunk1.store(z + vec_t::width, is_not_aligned);
37            zChunk2.store(z + 2*vec_t::width, is_not_aligned);
38            zChunk3.store(z + 3*vec_t::width, is_not_aligned);
39
40            // update pointers
41            x += 4*vec_t::width;
42            y += 4*vec_t::width;

Note that there are 2 loops, a vectorized loop and a spillover loop (which is used when the input array size is not a multiple of the vector width). For more complex methods the core of the loop is replaced by a call to a kernel that can accept both a vector type or a scalar type. In general the loops are characterized 3 sections: a load to local variables from the input arrays, a call to one or more kernels, a store from the local variables to the output arrays. The load and store operations need to specify the flag is_not_aligned if the referenced memory is not guaranteed to be aligned to the vector width boundaries. Otherwise a segmentation fault is just waiting to happen!

As an example of a method with a complex body with calls to multiple kernels refer to RoeSolverSIMD.cpp.

Usage with matrix free operators: the usage of the tineysimd library in the matrix free operators differs from the above due to the interleaving of n elements degree of freedoms (where n is the vector width) in a contiguous chunk of memory. You can refer to [53] for more details.

General optimization guidelines: a key factor to improve performance on modern architectures is to limit as much as possible data transfer from DRAM to cache

4.11 SIMDLib