Small speed-up in row-major sparse dense product
This MR changes the implementation of sparse_time_dense_product_impl for RowMajor matrices to use two accumulation variables instead of one. This breaks up the dependency chain of addition the values, and opens up options for the CPU to employ instruction-level parallelism.
I have run the following benchmark on my system (AMD Threadripper 1950, single threaded)
#define ANKERL_NANOBENCH_IMPLEMENT
#include <nanobench.h>
#include <Eigen/Sparse>
#include <random>
void benchmark_sparse_dense_vector(ankerl::nanobench::Bench& bench) {
const Eigen::Index SIZE = 100'000;
const Eigen::Index NNZ = 10'000'000;
Eigen::SparseMatrix<float, Eigen::RowMajor> matrix(SIZE, SIZE);
std::default_random_engine rng;
std::uniform_int_distribution<Eigen::Index> dist(0, SIZE - 1);
std::vector<Eigen::Triplet<float>> triplets;
triplets.reserve(NNZ);
auto start = std::chrono::steady_clock::now();
for(int i = 0; i < NNZ; ++i) {
triplets.emplace_back(dist(rng), dist(rng), 1.0);
}
matrix.setFromTriplets(begin(triplets), end(triplets));
Eigen::VectorXf rhs = Eigen::VectorXf::Random(SIZE);
Eigen::VectorXf dst(SIZE);
bench.run("Sparse * Dense -> Dense", [&] {dst = matrix * rhs; });
}
int main() {
ankerl::nanobench::Bench bench{};
bench.warmup(10).epochs(100);
bench.minEpochTime(std::chrono::nanoseconds{1'000'000});
bench.minEpochIterations(10);
benchmark_sparse_dense_vector(bench);
}
with the following results
SIZE = 1M
New
| ns/op | op/s | err% | ins/op | bra/op | miss% | total | benchmark |
|---|---|---|---|---|---|---|---|
| 15,099,473.67 | 66.23 | 0.4% | 89,126,601.18 | 14,625,689.82 | 7.2% | 16.55 | Sparse * Dense -> Dense |
Old
| ns/op | op/s | err% | ins/op | bra/op | miss% | total | benchmark |
|---|---|---|---|---|---|---|---|
| 16,052,703.27 | 62.29 | 1.4% | 90,625,231.30 | 14,125,126.00 | 7.3% | 18.99 | Sparse * Dense -> Dense |
SIZE = 100k
New
| ns/op | op/s | err% | ins/op | bra/op | miss% | total | benchmark |
|---|---|---|---|---|---|---|---|
| 8,723,373.15 | 114.63 | 1.1% | 62,853,098.64 | 10,452,488.20 | 1.0% | 9.73 | Sparse * Dense -> Dense |
Old
| ns/op | op/s | err% | ins/op | bra/op | miss% | total | benchmark |
|---|---|---|---|---|---|---|---|
| 9,254,136.85 | 108.06 | 0.4% | 71,994,036.60 | 10,402,745.36 | 1.0% | 10.45 | Sparse * Dense -> Dense |
On the other hand, in my real application (about 60% time spent sparse matrix times dense vector), I don't see any change. I suspect that is because I'm running many of these in parallel, and the computation is severely memory bound anyway.
Edited by Rasmus Munk Larsen