Traces Week 19 - 2023
After an inexcusable long hiatus, I wanted to share some insight into current development work for SSTT, which I hope to release by end of the year under an open source license. After revisiting the large codebase and working tediously to migrate code to C++20
and having a glimpse at C++23
, one thing caught my eye - ranges! In C++23
the functional paradigm <ranges>
API in STL got a rather substantial upgrade. However, I tend to be a bit more academic before jumping on such a theme, after all SSTT needs to be fast. As it turns out, beyond auto operator [] (int,int ...)
aka. multidimensional de-referencing operator, <mdspan>
is a huge and seemingly complex piece of code that will eventually ship with future C++ toolchains.
SSTT undergoing major surgery on the very foundation, especially to optimize internal data representations, std::mdspan
and friends were a promising alley to investigate. Let me share with you some of my insight.
Implementation
My custom sstt::core::Array<>
CRTP is obviously everywhere in the code. It represents images, textures, auxiliary buffers, point clouds etc. pp. It is for a reason a core system concept. And there are many good reasons to replace the old implementation, with non-optimal element access being one of them.
First I went on to decode the way std::mdspan
is actually converting spatial operations into a linear concept. Honestly, the API examples are not yet very useful, it took me a while to decode the actual apis of std::mdspan
and std::submdspan
. For this I used the single header pre-release of the Kokkos <mdspan.hpp>
here.
Target Scenario
My attempt was to replace and unify operations on contiguous and non-contiguous data (mostly images but also other operations) in sstt::core::Array<>
System
My test system was a i7 8700 (12 Cores), NVIDIA GTX 1070 with 16GiB RAM on a ArchLinux gcc (GCC) 13.1.1 20230429 with release Mode, hence -O3
and -std=c++23
were on. No TBB, because for those kind of operations I prefer single threaded performance measurements as the overarching pipelining system in SSTT will schedule things across cores, and there are lots of things going on. Putting additional CPU pressure for low-level operations is not that useful. However, proper vectorization and cache-line optimization can help a lot.
Measurements
Manual (is_contiguous() == true)
Arrays FullHD Mean 100 1 17.48 ms
173.909 us 172.452 us 176.058 us
8.87194 us 6.40105 us 11.9164 us
mdspan () contiguous data but no check
Arrays FullHD Mean 100 1 111.946 ms
1.11522 ms 1.10776 ms 1.12618 ms
45.5794 us 33.6912 us 59.4841 us
one reason - seems cache misses (simple perf stat)
mdspan() branch-misses:u # 0,40% of all branches
manual branch-misses:u # 0,26% of all branches
Conclusion
Semantically std::mdspan
is an absolute beauty. Just look at this:
// wrapper around data structure
auto m = Matrix<uint8_t>{1920, 1080};
// lambda for sum
auto sum = [](auto s) {
using T = decltype(s)::value_type;
decltype(T() + T()) sum {0};
const auto r = std::views::cartesian_product(
std::views::iota(0u,s.extent(0)), // u suffix because template
std::views::iota(0u,s.extent(1)) // deduction sees a size_t
);
std::ranges::for_each(r.begin(),r.end(),[&s,&sum](auto idx) { auto [r,c] = idx; sum += s[r,c]; } );
return sum;
};
// lambda for mean
auto mean = [&](auto s) {
return sum(s) / (s.extent(0) * s.extent(1));
};
// thats it ...
auto res = mean(m.span());
// ... and that where the power of mdspan and submdspan comes in handy
auto res = mean(m.view(Rectangle{Position{10,20},Size{16,16}).span());
It almost feels a bit like Rust. But yes, there are other tools in <algorithm>
such as std::accumulate
to achieve this, but they know nothing about 2D or 3D arrays and especially not strided views into them (or sliced if you come from HPC). For now I can live with a small proxy template to branch out for non-contiguous data (views).
The overhead introduced by std::mdspan
and friends, is seemingly heavy and I am not sure if that is my fault or the preliminary implementation I am using. My approach is a hit-and-miss interpolation from various documentation resources and spurious examples I found and therefore by no means perfect. But my slightly informed guess is, that the access over both extents creates quite some cacheline misses with computer vision typical small images. So, I’m not closing the book on std::mdspan
- just for my particular purpose here it seems not a good fit. I have moved my implementation into a new sstt::core::experimental
namespace to revisit later.
Over time, and if more people are using it, an improved documentation and examples will pop up, paper-cuts will be removed and maybe I can then get rid of my custom handling of multidimensional data structures.
References
- https://github.com/kokkos/mdspan
- https://www.ashermancinelli.com/std-mdspan-tensors
- https://github.com/kokkos/mdspan/wiki/A-Gentle-Introduction-to-mdspan
- https://cppcast.com/too-cute-mdspan/
- this is where I understood how “strides” are “slices” in HPC land … https://youtu.be/aFCLmQEkPUw?t=1108